x-datascience-datacamp · clmrie · Dec 16, 2024 · Jan 2, 2025
diff --git a/numpy_questions.py b/numpy_questions.py
@@ -1,19 +1,51 @@
-"""Assignment - using numpy and making a PR.
-
-The goals of this assignment are:
-    * Use numpy in practice with two easy exercises.
-    * Use automated tools to validate the code (`pytest` and `flake8`)
-    * Submit a Pull-Request on github to practice `git`.
-
-The two functions below are skeleton functions. The docstrings explain what
-are the inputs, the outputs and the expected error. Fill the function to
-complete the assignment. The code should be able to pass the test that we
-wrote. To run the tests, use `pytest test_numpy_question.py` at the root of
-the repo. It should say that 2 tests ran with success.
-
-We also ask to respect the pep8 convention: https://pep8.org.
-This will be enforced with `flake8`. You can check that there is no flake8
-errors by calling `flake8` at the root of the repo.
+"""Assignment - making a sklearn estimator and cv splitter.
+
+The goal of this assignment is to implement by yourself:
+
+- a scikit-learn estimator for the KNearestNeighbors for classification
+  tasks and check that it is working properly.
+- a scikit-learn CV splitter where the splits are based on a Pandas
+  DateTimeIndex.
+
+Detailed instructions for question 1:
+The nearest neighbor classifier predicts for a point X_i the target y_k of
+the training sample X_k which is the closest to X_i. We measure proximity with
+the Euclidean distance. The model will be evaluated with the accuracy (average
+number of samples corectly classified). You need to implement the `fit`,
+`predict` and `score` methods for this class. The code you write should pass
+the test we implemented. You can run the tests by calling at the root of the
+repo `pytest test_sklearn_questions.py`. Note that to be fully valid, a
+scikit-learn estimator needs to check that the input given to `fit` and
+`predict` are correct using the `check_*` functions imported in the file.
+You can find more information on how they should be used in the following doc:
+https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator.
+Make sure to use them to pass `test_nearest_neighbor_check_estimator`.
+
+
+Detailed instructions for question 2:
+The data to split should contain the index or one column in
+datatime format. Then the aim is to split the data between train and test
+sets when for each pair of successive months, we learn on the first and
+predict of the following. For example if you have data distributed from
+november 2020 to march 2021, you have have 4 splits. The first split
+will allow to learn on november data and predict on december data, the
+second split to learn december and predict on january etc.
+
+We also ask you to respect the pep8 convention: https://pep8.org. This will be
+enforced with `flake8`. You can check that there is no flake8 errors by
+calling `flake8` at the root of the repo.
+
+Finally, you need to write docstrings for the methods you code and for the
+class. The docstring will be checked using `pydocstyle` that you can also
+call at the root of the repo.
+
+Hints
+-----
+- You can use the function:
+
+from sklearn.metrics.pairwise import pairwise_distances
+
+to compute distances between 2 sets of samples.
 """
 import numpy as np
 
@@ -29,20 +61,21 @@ def max_index(X):
     Returns
     -------
     (i, j) : tuple(int)
-        The row and columnd index of the maximum.
+        The row and column index of the maximum.
 
     Raises
     ------
     ValueError
         If the input is not a numpy array or
         if the shape is not 2D.
     """
-    i = 0
-    j = 0
-
-    # TODO
-
-    return i, j
+    if not isinstance(X, np.ndarray):
+        raise ValueError("Input must be a numpy array.")
+    if X.ndim != 2:
+        raise ValueError("Input must be a 2D numpy array.")
+    # Find the index of the maximum element
+    max_pos = np.unravel_index(np.argmax(X), X.shape)
+    return max_pos
 
 
 def wallis_product(n_terms):
@@ -57,11 +90,17 @@ def wallis_product(n_terms):
         Number of steps in the Wallis product. Note that `n_terms=0` will
         consider the product to be `1`.
 
+
     Returns
     -------
     pi : float
         The approximation of order `n_terms` of pi using the Wallis product.
     """
-    # XXX : The n_terms is an int that corresponds to the number of
-    # terms in the product. For example 10000.
-    return 0.
+    if n_terms == 0:
+        return 2.0  # Wallis product starts with 2 when no terms are considered
+
+    product = 1.0
+    for n in range(1, n_terms + 1):
+        term = (4 * n**2) / (4 * n**2 - 1)
+        product *= term
+    return 2 * product