Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scikit-learn style krige parameter optimisation #24

Merged
merged 24 commits into from
Dec 15, 2016
Merged

scikit-learn style krige parameter optimisation #24

merged 24 commits into from
Dec 15, 2016

Conversation

basaks
Copy link
Collaborator

@basaks basaks commented Dec 4, 2016

I am using this parameter optimisation in a project. May be someone will benefit from this.

Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your Pull Request! Adding a scikit-learn compatible API could definitely be interesting. A few comments on the code,

  1. First of all, we will probably not be able to add scikit-learn as a hard dependency ( because it's a massive library and in this PR you are using just a few classes from it). So this PR should work with or without scikit-learn installed, this means that,
    1. For defining the scikit-learn compatible estimator (where you need BaseEstimator and RegressorMixin classes from scikit-learn) the solution could be something similar to what is done in xgboost or here. Alternatively, this could be addressed in a subsequent PR, and just raising an ImportError when this module is imported without scikit learn could be fine (and raising a SkipTest in the unit tests).
      2. Everything else Pipleline, GridSearchCV etc, should not be imported in PyKrige, but rather illustrated in a separate example.
  2. IMO PyKrige can just expose a scikit-learn compatible Kriging class, everything else (pipelining, Cross-Validation, any other form of pre-post processing) should be up to the user. In particular, this means that we could maybe just move pykrige/optimise/pipeline.py to e.g. examples/krige_cv.py
  3. Why do you need to wrap the Kriging class in a pipeline? GridSearchCV should work directly on the Kriging class, shouldn't it?
  4. Regarding filenames, it might be best to,
    • move pykrige/optimise/krige.py to pykrige/sklearn.py (or sklearn_compat.py)?

    • move pykrige/optimise/pipeline.py to examples/krige_cv.py (or any other appropriate name )

      • and remove there anything related to ConfigParser, saving to CSV (and if possible pipeline), as that a bit too specific. Just printing the output should be fine.
    • remove pykrige/optimise/README.md alltogether. I think it would be better, to a) add a section at the end of the readme on how to run this example b) in the example link to http://scikit-learn.org/stable/modules/cross_validation.html

What do you think?

- python: "3.4"
env: DEPS="numpy=1.10.4 scipy=0.17 cython nose matplotlib"
env: DEPS="numpy=1.11.2 scipy=0.17 cython nose matplotlib scikit-learn=0.18.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numpy==1.10.4 was actually intentional here, to test that PyKrige works with multiple numpy versions not just the latest.

Copy link
Collaborator Author

@basaks basaks Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would you do that? I always use a virtualenv for python, even on a supercomputer. Any particular reason why you would not upgrade from numpy 1.10.4?

The reason for the change in numpy version is that scikit-learn=0.18.1 requires numpy 1.11.+

Copy link
Collaborator Author

@basaks basaks Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rth Some good points there. Thanks.
Point 1 is very sensible.
Point 2: This is what that Krige class is?
Point 3. the wrapper Krige class makes the pykrige classes scikit-learn compatible.
Point 4, you are spot on. That pipeline.py is just an example of how to use the Krige class. We can rename it something like you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason to support (and test) multiple version of dependencies is to reduce the chance of a dependency conflict (e.g. Package A depends on package C-v1, package B depends on C-v2, and you need A and B). I also use the latest numpy version, but in general we cannot assume that (e.g. in large legacy systems with a significant cost of upgrading). For instance, scikit-learn will install numpy-1.11 if it's not present but it supports any numpy versions starting from 1.6.1 (and also test several version in Travis CI. Here we just test the 2 latest numpy version 0.10 for PY<3.5 and 0.11 for PY 3.5.

Point 2: I was referring to the new Krige class you created in this PR.

Copy link
Collaborator Author

@basaks basaks Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rth There is no dependency conflict as all the tests pass with the latest numpy version. Scikit-learn 0.18.+ has many improvements and requires numpy 1.11.+.

Even on a legacy system you can use a virtualenv. Has there been any problem with creating the pykrige virtualenv on a legacy system?

The Krige class is the convenience class that makes the pykrige OrdinaryKriging and UniversalKriging classes scikit-learn compatible.

PCKG_DAT = {'pykrige': ['README.md', 'CHANGELOG.md', 'LICENSE.txt', 'MANIFEST.in',
join('test_data', '*.txt'), join('test_data', '*.asc')]}
REQ = ['numpy', 'scipy', 'matplotlib']
REQ = ['numpy', 'scipy', 'matplotlib', 'sklearn']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearn shouldn't be added to mandatory requirements.

Copy link
Collaborator Author

@basaks basaks Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I can look into that.
I will use a new PR once I have managed to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! As this is actually a lot of code, feel free to split this into several smaller PRs if you prefer. Thanks again for contributing :)

@@ -21,19 +21,20 @@
DESC = 'Kriging Toolkit for Python'
LDESC = 'PyKrige is a kriging toolkit for Python that supports two- and ' \
'three-dimensional ordinary and universal kriging.'
PACKAGES = ['pykrige']
PACKAGES = ['pykrige', 'pykrige.optimise']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be just one package pykrige.

Copy link
Collaborator Author

@basaks basaks Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you put such a restriction? It just seems natural to add something like this in a subpackage, as not too many people will use this and is not part of core functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I was just wondering why we need this, since they are in the same setup.py both will be installed at the same time anyway (and a single package PyKrige is installed when you run this version of setup.py). So, if you don't add this, the result would be the same, wouldn't it? Only users who need pykrige.optimize (or rather pykrige.sklearn) would import this, but it can still be installed by default?

@rth
Copy link
Contributor

rth commented Dec 5, 2016

P.S: @basaks BTW, have you checked if this new Kriging estimator passes check_estimator from sklearn.utils.estimator_checks (cf "Rolling your own estimator" docs)? Even if it doesn't it's probably OK as that test is not completely general (scikit-learn/scikit-learn#6715 ) but it could be useful to detect API inconsistencies...

@basaks
Copy link
Collaborator Author

basaks commented Dec 5, 2016

@rth No, I have not. However, the fact that GridSearchCV works with this class is a proof in in itself of that.
I will run it through those further checks and use another PR. Thank you for pointing this out.

@rth
Copy link
Contributor

rth commented Dec 6, 2016

There is no dependency conflict as all the tests pass with the latest numpy version.
Even on a legacy system you can use a virtualenv. Has there been any problem with creating the pykrige virtualenv on a legacy system?

virtualenv doesn't solve dependency issues. To give a more practical example, say I have a work project developed a year ago with scikit-learn-0.17.1 (at the time) and numpy 1.10 (in a virtualenv). It works fine, and now I want to add Kriging as a new feature. But then if we made PyKrige depend on scikit-learn-0.18.1 and numpy 1.11, I would be stuck. I would have to spend time upgrading my whole project to these versions, or testing by myself that PyKrige works with the previous versions even if they are not supported, or updating PyKrige to work with scikit-learn 0.17.1 and numpy 0.11.+ . All of which are probably useful but not what I wanted / was funded to do. This is the reason to reduce the dependencies to a strict minimum and to support multiple versions of those. It's the same reason why multiple Python versions are typically supported.

I actually have one such project (depending on numpy 0.10¹) and using PyKrige, so I'm -1 on this, though would be happy to hear other opinions.

Scikit-learn 0.18.+ [..] requires numpy 1.11.+.

Could you provide a url link confirming that? )

¹Even if there might be almost no backward incompatible changes between 0.10 and 0.11..

@basaks
Copy link
Collaborator Author

basaks commented Dec 6, 2016

Could you provide a url link confirming that? )

Yes, it;s not necessary. I checked requirements for scikit-learn. It's just that pip pulls in the latest numpy by default when you install scikit-learn.

I will put together another PR soon :)

edit: it's not pip, it's the conda package manager that requires numpy=1.11.+ with sciki-learn=1.18.+. See details in comment below.

@basaks
Copy link
Collaborator Author

basaks commented Dec 6, 2016

I can build a virtrualenv with sciki-learn 0.18.1 and numpy 1.10.4 on my pc, but does work due to the anaconda packaging during Travis build:

$ python --version
Python 3.4.2
$ pip --version
pip 6.0.7 from /home/travis/virtualenv/python3.4.2/lib/python3.4/site-packages (python 3.4)
before_install.1
0.34s$ wget http://repo.continuum.io/miniconda/Miniconda${TRAVIS_PYTHON_VERSION:0:1}-latest-Linux-x86_64.sh -O miniconda.sh
--2016-12-06 11:22:13--  http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
Resolving repo.continuum.io (repo.continuum.io)... 104.16.19.10, 104.16.18.10, 2400:cb00:2048:1::6810:120a, ...
Connecting to repo.continuum.io (repo.continuum.io)|104.16.19.10|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh [following]
--2016-12-06 11:22:13--  https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
Connecting to repo.continuum.io (repo.continuum.io)|104.16.19.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33905474 (32M) [application/octet-stream]
Saving to: `miniconda.sh'
100%[======================================>] 33,905,474   143M/s   in 0.2s    
2016-12-06 11:22:13 (143 MB/s) - `miniconda.sh' saved [33905474/33905474]
before_install.2
0.01s$ chmod +x miniconda.sh
before_install.3
6.74s$ ./miniconda.sh -b
PREFIX=/home/travis/miniconda3
installing: python-3.5.2-0 ...
installing: conda-env-2.6.0-0 ...
installing: openssl-1.0.2j-0 ...
installing: pycosat-0.6.1-py35_1 ...
installing: readline-6.2-2 ...
installing: requests-2.11.1-py35_0 ...
installing: ruamel_yaml-0.11.14-py35_0 ...
installing: sqlite-3.13.0-0 ...
installing: tk-8.5.18-0 ...
installing: xz-5.2.2-0 ...
installing: yaml-0.1.6-0 ...
installing: zlib-1.2.8-3 ...
installing: conda-4.2.12-py35_0 ...
installing: pycrypto-2.6.1-py35_4 ...
installing: pip-8.1.2-py35_0 ...
installing: wheel-0.29.0-py35_0 ...
installing: setuptools-27.2.0-py35_0 ...
Python 3.5.2 :: Continuum Analytics, Inc.
creating default environment...
installation finished.
before_install.4
0.00s$ export PATH=/home/travis/miniconda${TRAVIS_PYTHON_VERSION:0:1}/bin:$PATH
before_install.5
2.47s$ conda update --yes conda
Fetching package metadata .......
Solving package specifications: ..........
Package plan for installation in environment /home/travis/miniconda3:
The following packages will be downloaded:
    package                    |            build
    ---------------------------|-----------------
    conda-4.2.13               |           py35_0         402 KB
The following packages will be UPDATED:
    conda: 4.2.12-py35_0 --> 4.2.13-py35_0
Fetching packages ...
conda-4.2.13-p 100% || Time: 0:00:00  25.61 MB/s
Extracting packages ...
[      COMPLETE      ]|| 100%
Unlinking packages ...
[      COMPLETE      ]|| 100%
Linking packages ...
[      COMPLETE      ]|| 100%
before_install.6
0.56s$ conda info -a
Current conda install:
               platform : linux-64
          conda version : 4.2.13
       conda is private : False
      conda-env version : 4.2.13
    conda-build version : not installed
         python version : 3.5.2.final.0
       requests version : 2.11.1
       root environment : /home/travis/miniconda3  (writable)
    default environment : /home/travis/miniconda3
       envs directories : /home/travis/miniconda3/envs
          package cache : /home/travis/miniconda3/pkgs
           channel URLs : https://repo.continuum.io/pkgs/free/linux-64
                          https://repo.continuum.io/pkgs/free/noarch
                          https://repo.continuum.io/pkgs/pro/linux-64
                          https://repo.continuum.io/pkgs/pro/noarch
            config file : None
           offline mode : False
# conda environments:
#
root                  *  /home/travis/miniconda3
sys.version: 3.5.2 |Continuum Analytics, Inc.| (defau...
sys.prefix: /home/travis/miniconda3
sys.executable: /home/travis/miniconda3/bin/python
conda location: /home/travis/miniconda3/lib/python3.5/site-packages/conda
conda-build: None
conda-env: /home/travis/miniconda3/bin/conda-env
user site dirs: 
CIO_TEST: <not set>
CONDA_DEFAULT_ENV: <not set>
CONDA_ENVS_PATH: <not set>
LD_LIBRARY_PATH: <not set>
PATH: /home/travis/miniconda3/bin:/home/travis/virtualenv/python3.4.2/bin:/home/travis/bin:/home/travis/.local/bin:/home/travis/.rvm/gems/ruby-1.9.3-p551/bin:/home/travis/.rvm/gems/ruby-1.9.3-p551@global/bin:/home/travis/.rvm/rubies/ruby-1.9.3-p551/bin:/opt/python/2.7.9/bin:/opt/python/2.6.9/bin:/opt/python/3.4.2/bin:/opt/python/3.3.5/bin:/opt/python/3.2.5/bin:/opt/python/pypy-2.5.0/bin:/opt/python/pypy3-2.4.0/bin:/usr/local/phantomjs/bin:/home/travis/.nvm/v0.10.36/bin:./node_modules/.bin:/usr/local/maven-3.2.5/bin:/usr/local/clang-3.4/bin:/home/travis/.gimme/versions/go1.4.1.linux.amd64/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/travis/.rvm/bin
PYTHONHOME: <not set>
PYTHONPATH: <not set>
WARNING: could not import _license.show_info
# try:
# $ conda install -n root _license
2.19s$ conda install --yes $DEPS pip
Fetching package metadata .......
Solving package specifications: ....
UnsatisfiableError: The following specifications were found to be in conflict:
  - numpy 1.10.4*
  - scikit-learn 0.18.1* -> numpy 1.11*
Use "conda info <package>" to see the dependencies for each package.

@rth
Copy link
Contributor

rth commented Dec 6, 2016

Yes, it looks like conda only builds scikit-learn-0.18.1 with numpy 1.11 (probably because they already have to build [3 Python versions] x [MLK/no MLK builds] and don't want to add another x [Number of numpy versions].

% conda search scikit-learn                                                                                      
Fetching package metadata: ....
scikit-learn
                            [...]
                             0.18                np111py27_0  defaults        
                             0.18            np111py27_nomkl_0  defaults        [nomkl]
                             0.18                np111py34_0  defaults        
                             0.18            np111py34_nomkl_0  defaults        [nomkl]
                             0.18                np111py35_0  defaults        
                             0.18            np111py35_nomkl_0  defaults        [nomkl]
                             0.18.1              np111py27_0  defaults        
                             0.18.1          np111py27_nomkl_0  defaults        [nomkl]
                             0.18.1              np111py34_0  defaults        
                             0.18.1          np111py34_nomkl_0  defaults        [nomkl]
                          .  0.18.1              np111py35_0  defaults        
                             0.18.1          np111py35_nomkl_0  defaults        [nomkl]

Maybe what we could do for now in travis is to add scikit-learn 0.18.1 for the Python 3.5 line (that has numpy 1.11) and just not install scikit-learn for other python versions / numpy versions. Then skip the tests that need sklearn by raising a unittest.SkipTest ?

@basaks
Copy link
Collaborator Author

basaks commented Dec 6, 2016

@rth I tried to get numpy=1.9.2 working as well with python2.7, but there does not seem to be any stable conda environment that works with both numpy=1.9.2 and scipy=0.17.

Anyway all your orginal deps are working in both python 3 and python 2 environments.

I have addressed all your concerns with the original PR.

@basaks
Copy link
Collaborator Author

basaks commented Dec 9, 2016

@rth Let me know if I have missed anything. I did not use another PR as the scope of the PR is still the same, i.e., I did not break it up into many PRs and hopefully managed to address all your concerns.

@rth
Copy link
Contributor

rth commented Dec 10, 2016

@basaks Sorry for the late response. Yes, conda can be frustrating sometimes. Just a few last comments,

  • could you please move the /examples folder one level up (so it's in the top level directory as for instance in https://github.com/scikit-learn/scikit-learn)
  • I tried to simplify the example as much as possible (by removing configparser, pipeline, using default parameters for GridSearchCV and reducing how many lines it prints) here http://pastebin.com/jWvHKi1q Would you be OK with that?
  • Would you mind renaming optimise.py to sklearn.py (or something similar) as the functions that this modules include do not perform any optimization but just expose a scikit learn compatible API.
  • Also the ConfigException, TagsMixin and KrigePredictProbaMixin classes are not actually used in the example (or unit tests), so if the Krige class is defined as class Krige(RegressorMixin, BaseEstimator) everything still works. I understand that you using them in your code, maybe it would be best to include them in the next PR (not this one). Both would require more discussion as they are not standard with respect to the scikit-learn API (predict_proba typically returns a single array not 4), and the estimators tags are still being developed for scikit v0.19 (cf issue 6599 at https://github.com/scikit-learn/scikit-learn/issues)
  • Regarding the new section in the README
    • optimise -> optimize everywhere : sorry I know you are in Australia, but scientific python community (and this package) uses American English (e.g. scipy.optimize)

    • Could we merge the two current sections "Kriging Optimiser" and "How to use the optimise module" into just one called "Kriging parameters tuning" (or something similar). In general, I would remove all references to the "optimization module" (it's debatable whether finding best parameters by cross-validation can be called optimization, and in any case this PR doesn't add any optimization capability, just a new API) , and say something along the lines of,

      PyKrige also exposes a scikit learn compatible API, which can be used to perform parameter tuning using sklearn.model_selection.GridSearchCV (with a link scikt-learn docs). [Maybe some explanations on the parameters that can be tuned] You can run the corresponding example with
      python examples/krige_cv.py

      Maybe remove the table at the end in the readme (among other things the mean_test_score is R² score by default, so when it's negative it means that the predictions are pretty bad, which is OK as we use random data, but probably not something you would want to have in a README.)

Thanks again for this PR. What do you think?

@bsmurphy Would you have time to have a look at this PR, to know if you are OK with it? Thanks!

@basaks
Copy link
Collaborator Author

basaks commented Dec 10, 2016

@rth Excellent suggestions. I agree to all of them.
My judgement on this PR is a bit biased as I am using these classes in a specialized pipeline.
Thank you for your valuable inputs.

I have made all changes your suggest except that I could not rename the optimiser.py to sklearn.py as then when I use import sklearn the python interpreter may instead import this file, instead of scikit-learn depending on your paths.

@rth
Copy link
Contributor

rth commented Dec 10, 2016

Thanks, @basaks ! This looks good to me.

Will just wait a few more days before merging in case bsmurphy (or anybody else interested) wants to have a look at this PR.

@rth rth merged commit 07af4a5 into GeoStat-Framework:master Dec 15, 2016
@rth rth mentioned this pull request Dec 20, 2016
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants