recommenders-team · miguelgfierro · Jul 28, 2020 · Jul 11, 2020 · Jul 15, 2020 · Jul 20, 2020
@@ -0,0 +1,324 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Geometry Aware Inductive Matrix Completion (GeoIMC)\n",
+    "\n",
+    "GeoIMC is an inductive matrix completion algorithm based on the works by Jawanpuria et al. (2019)\n",
+    "\n",
+    "Consider the case of MovieLens-100K (ML100K), Let $X \\in R^{m \\times d_1}, Z \\in R^{n \\times d_2} $ be the features of users and movies respectively. Let $M \\in R^{m \\times n}$, be the partially observed ratings matrix. GeoIMC models this matrix as $M = XUBV^TZ^T$, where $U \\in R^{d_1 \\times k}, V \\in R^{d_2 \\times k}, B \\in R^{k \\times k}$ are Orthogonal, Orthogonal, Symmetric Positive-Definite matrices respectively. This Optimization problem is solved by using Pymanopt.\n",
+    "\n",
+    "\n",
+    "This notebook provides an example of how to utilize and evaluate GeoIMC implementation in **reco_utils**\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "import tempfile\n",
+    "import zipfile\n",
+    "import pandas as pd\n",
+    "sys.path.append(\"../../\")\n",
+    "sys.path.append(\"../../reco_utils/recommender/geoimc/\")\n",
+    "\n",
+    "from reco_utils.dataset import movielens\n",
+    "from reco_utils.recommender.geoimc.GeoIMCdata import ML_100K\n",
+    "from reco_utils.recommender.geoimc.GeoIMCalgorithm import IMCProblem\n",
+    "from reco_utils.recommender.geoimc.GeoIMCpredict import Inferer\n",
+    "from reco_utils.evaluation.python_evaluation import (\n",
+    "    rmse, mae\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Choose the MovieLens dataset\n",
+    "MOVIELENS_DATA_SIZE = '100k'\n",
+    "# Normalize user, item features\n",
+    "normalize = True\n",
+    "# Rank (k) of the model\n",
+    "rank = 300\n",
+    "# Regularization parameter\n",
+    "regularizer = 1e-3\n",
+    "\n",
+    "# Parameters for algorithm convergence\n",
+    "max_iters = 150000\n",
+    "max_time = 1000\n",
+    "verbosity = 1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Download ML100K dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 4.81k/4.81k [00:01<00:00, 4.77kKB/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Create a directory to download ML100K\n",
+    "dp = tempfile.mkdtemp(suffix='-geoimc')\n",
+    "movielens.download_movielens(MOVIELENS_DATA_SIZE, f\"{dp}/ml-100k.zip\")\n",
+    "with zipfile.ZipFile(f\"{dp}/ml-100k.zip\", 'r') as z:\n",
+    "    z.extractall(dp)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Load the dataset using the example features provided in helpers\n",
+    "\n",
+    "The features were generated using the same method as the work by Xin Dong et al. (2017)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset = ML_100K(\n",
+    "    normalize=normalize,\n",
+    "    target_transform='binarize'\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset.load_data(\n",
+    "    f\"{dp}/ml-100k/\", \n",
+    "    f\"../../reco_utils/recommender/geoimc/ml100k-features/user-features.smat\",\n",
+    "    f\"../../reco_utils/recommender/geoimc/ml100k-features/item-features.smat\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Characteristics:\n",
+      "\n",
+      "              target: (943, 1682)\n",
+      "              entities: (943, 1822), (1682, 1923)\n",
+      "\n",
+      "              training: (80000,)\n",
+      "              training_entities: (943, 1822), (1682, 1923)\n",
+      "\n",
+      "              testing: (20000,)\n",
+      "              test_entities: (943, 1822), (1682, 1923)\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"\"\"Characteristics:\n",
+    "\n",
+    "              target: {dataset.training_data.data.shape}\n",
+    "              entities: {dataset.entities[0].shape}, {dataset.entities[1].shape}\n",
+    "\n",
+    "              training: {dataset.training_data.get_data().data.shape}\n",
+    "              training_entities: {dataset.training_data.get_entity(\"row\").shape}, {dataset.training_data.get_entity(\"col\").shape}\n",
+    "\n",
+    "              testing: {dataset.test_data.get_data().data.shape}\n",
+    "              test_entities: {dataset.test_data.get_entity(\"row\").shape}, {dataset.test_data.get_entity(\"col\").shape}\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Initialize the IMC problem"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prblm = IMCProblem(\n",
+    "    dataset.training_data,\n",
+    "    lambda1=regularizer,\n",
+    "    rank=rank\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Optimizing...\n",
+      "Terminated - max time reached after 1742 iterations.\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Solve the Optimization problem\n",
+    "prblm.solve(\n",
+    "    max_time,\n",
+    "    max_iters,\n",
+    "    verbosity\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Initialize an inferer\n",
+    "inferer = Inferer(\n",
+    "    method='dot'\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Predict using the parametrized matrices\n",
+    "predictions = inferer.infer(\n",
+    "    dataset.test_data,\n",
+    "    prblm.W\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Prepare the test, predicted dataframes\n",
+    "user_ids = dataset.test_data.get_data().tocoo().row\n",
+    "item_ids = dataset.test_data.get_data().tocoo().col\n",
+    "test_df = pd.DataFrame(\n",
+    "    data={\n",
+    "        \"userID\": user_ids,\n",
+    "        \"itemID\": item_ids,\n",
+    "        \"rating\": dataset.test_data.get_data().data\n",
+    "    }\n",
+    ")\n",
+    "predictions_df = pd.DataFrame(\n",
+    "    data={\n",
+    "        \"userID\": user_ids,\n",
+    "        \"itemID\": item_ids,\n",
+    "        \"prediction\": [predictions[uid, iid] for uid, iid in list(zip(user_ids, item_ids))]\n",
+    "    }\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "RMSE: 0.41945523047207506\n",
+      "\n",
+      "\n",
+      "MAE: 0.353115760378552\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Calculate RMSE\n",
+    "print(f\"\"\"\n",
+    "RMSE: {\n",
+    "    rmse(\n",
+    "        test_df,\n",
+    "        predictions_df\n",
+    ")}\n",
+    "\"\"\")\n",
+    "# Calculate MAE\n",
+    "print(f\"\"\"\n",
+    "MAE: {\n",
+    "    mae(\n",
+    "        test_df,\n",
+    "        predictions_df\n",
+    ")}\n",
+    "\"\"\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## References\n",
+    "\n",
+    "[1] Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, Bamdev Mishra. _[Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach](https://www.mitpressjournals.org/doi/full/10.1162/tacl_a_00257)_. Transaction of the Association for Computational Linguistics (TACL), Volume 7, p.107-120, 2019.\n",
+    "\n",
+    "[2] Xin Dong, Lei Yu, Zhonghuo Wu, Yuxia Sun, Lingfeng Yuan, Fangxi Zhang. [A Hybrid Collaborative Filtering Model withDeep Structure for Recommender Systems](https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14676/13916).\n",
+    "Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), p.1309-1315, 2017."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python (reco)",
+   "language": "python",
+   "name": "reco_base"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}