Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poincare training bug #1917

Closed
menshikh-iv opened this issue Feb 20, 2018 · 0 comments · Fixed by #1959
Closed

Poincare training bug #1917

menshikh-iv opened this issue Feb 20, 2018 · 0 comments · Fixed by #1959
Assignees
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills

Comments

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 20, 2018

Description

I trained Poincare model on wiki graph and receive this exception

Steps/Code/Corpus to Reproduce

I have no good example for reproducing, but what I exactly did

from gensim.models.poincare import PoincareModel
import logging
import json
from tqdm import tqdm
from smart_open import smart_open

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')

class WikiGraphReader(object):
    def __init__(self, pth):
        self.pth = pth
        
    def __iter__(self):
        with smart_open(self.pth, 'r') as infile:
            for row in tqdm(infile):
                row = json.loads(row)
                src = row["s"]

                for dst in row["d"]:
                    yield (src, dst)
         
corpus = WikiGraphReader("edges.jsonl.gz")
model = PoincareModel(corpus)
model.train(epochs=1, batch_size=1000)  # all fine, trained successfully
model.save("poincare-1ep-wiki.model")
model.train(epochs=1, batch_size=1000)  # exception from here line
model.save("p_model/poincare-1.5ep-wiki.model")  # I saved this model too

Full stack trace from second model.train(epochs=1, batch_size=1000)

2018-02-05 10:49:58,008 - training model of size 50 with 1 workers on 128138847 relations for 1 epochs and 10 burn-in epochs, using lr=0.01000 burn-in lr=0.01000 negative=10
2018-02-05 10:49:58,010 - Starting burn-in (10 epochs)----------------------------------------
2018-02-05 10:56:51,400 - Training on epoch 1, examples #999000-#1000000, loss: 2188.85
2018-02-05 10:56:51,404 - Time taken for 1000000 examples: 329.60 s, 3033.98 examples / s
2018-02-05 11:01:44,625 - Training on epoch 1, examples #1999000-#2000000, loss: 2187.71
2018-02-05 11:01:44,627 - Time taken for 1000000 examples: 293.22 s, 3410.41 examples / s
2018-02-05 11:06:38,729 - Training on epoch 1, examples #2999000-#3000000, loss: 2186.41
2018-02-05 11:06:38,731 - Time taken for 1000000 examples: 294.10 s, 3400.18 examples / s
2018-02-05 11:11:28,291 - Training on epoch 1, examples #3999000-#4000000, loss: 2185.42
2018-02-05 11:11:28,293 - Time taken for 1000000 examples: 289.56 s, 3453.52 examples / s
2018-02-05 11:16:16,831 - Training on epoch 1, examples #4999000-#5000000, loss: 2184.04
2018-02-05 11:16:16,833 - Time taken for 1000000 examples: 288.54 s, 3465.75 examples / s
2018-02-05 11:21:06,625 - Training on epoch 1, examples #5999000-#6000000, loss: 2182.88
2018-02-05 11:21:06,630 - Time taken for 1000000 examples: 289.79 s, 3450.75 examples / s
2018-02-05 11:26:55,483 - Training on epoch 1, examples #6999000-#7000000, loss: 2181.47
2018-02-05 11:26:55,484 - Time taken for 1000000 examples: 348.85 s, 2866.54 examples / s
2018-02-05 11:31:45,830 - Training on epoch 1, examples #7999000-#8000000, loss: 2180.34
2018-02-05 11:31:45,839 - Time taken for 1000000 examples: 290.34 s, 3444.18 examples / s
2018-02-05 11:36:30,690 - Training on epoch 1, examples #8999000-#9000000, loss: 2179.56
2018-02-05 11:36:30,692 - Time taken for 1000000 examples: 284.85 s, 3510.62 examples / s
2018-02-05 11:41:15,313 - Training on epoch 1, examples #9999000-#10000000, loss: 2178.03
2018-02-05 11:41:15,315 - Time taken for 1000000 examples: 284.62 s, 3513.45 examples / s
2018-02-05 11:46:00,357 - Training on epoch 1, examples #10999000-#11000000, loss: 2177.52
2018-02-05 11:46:00,358 - Time taken for 1000000 examples: 285.04 s, 3508.26 examples / s
2018-02-05 11:50:48,905 - Training on epoch 1, examples #11999000-#12000000, loss: 2175.87
2018-02-05 11:50:48,910 - Time taken for 1000000 examples: 288.55 s, 3465.64 examples / s
2018-02-05 11:55:35,918 - Training on epoch 1, examples #12999000-#13000000, loss: 2174.76
2018-02-05 11:55:35,919 - Time taken for 1000000 examples: 287.01 s, 3484.23 examples / s
2018-02-05 12:00:24,240 - Training on epoch 1, examples #13999000-#14000000, loss: 2173.49
2018-02-05 12:00:24,242 - Time taken for 1000000 examples: 288.32 s, 3468.36 examples / s
2018-02-05 12:05:07,573 - Training on epoch 1, examples #14999000-#15000000, loss: 2172.35
2018-02-05 12:05:07,574 - Time taken for 1000000 examples: 283.33 s, 3529.45 examples / s
2018-02-05 12:09:52,164 - Training on epoch 1, examples #15999000-#16000000, loss: 2171.20
2018-02-05 12:09:52,165 - Time taken for 1000000 examples: 284.59 s, 3513.83 examples / s
2018-02-05 12:14:41,436 - Training on epoch 1, examples #16999000-#17000000, loss: 2170.33
2018-02-05 12:14:41,438 - Time taken for 1000000 examples: 289.27 s, 3456.97 examples / s
2018-02-05 12:19:34,138 - Training on epoch 1, examples #17999000-#18000000, loss: 2169.56
2018-02-05 12:19:34,142 - Time taken for 1000000 examples: 292.70 s, 3416.47 examples / s
2018-02-05 12:24:27,812 - Training on epoch 1, examples #18999000-#19000000, loss: 2168.17
2018-02-05 12:24:27,814 - Time taken for 1000000 examples: 293.67 s, 3405.19 examples / s
2018-02-05 12:29:15,083 - Training on epoch 1, examples #19999000-#20000000, loss: 2167.16
2018-02-05 12:29:15,085 - Time taken for 1000000 examples: 287.27 s, 3481.06 examples / s
2018-02-05 12:34:03,589 - Training on epoch 1, examples #20999000-#21000000, loss: 2165.85
2018-02-05 12:34:03,590 - Time taken for 1000000 examples: 288.50 s, 3466.17 examples / s
2018-02-05 12:38:50,770 - Training on epoch 1, examples #21999000-#22000000, loss: 2164.89
2018-02-05 12:38:50,772 - Time taken for 1000000 examples: 287.18 s, 3482.14 examples / s
2018-02-05 12:43:41,125 - Training on epoch 1, examples #22999000-#23000000, loss: 2163.63
2018-02-05 12:43:41,129 - Time taken for 1000000 examples: 290.35 s, 3444.09 examples / s
2018-02-05 12:48:27,127 - Training on epoch 1, examples #23999000-#24000000, loss: 2162.46
2018-02-05 12:48:27,129 - Time taken for 1000000 examples: 286.00 s, 3496.53 examples / s
2018-02-05 12:53:17,683 - Training on epoch 1, examples #24999000-#25000000, loss: 2161.23
2018-02-05 12:53:17,684 - Time taken for 1000000 examples: 290.55 s, 3441.71 examples / s
2018-02-05 12:58:02,880 - Training on epoch 1, examples #25999000-#26000000, loss: 2160.17
2018-02-05 12:58:02,881 - Time taken for 1000000 examples: 285.20 s, 3506.37 examples / s
2018-02-05 13:02:47,177 - Training on epoch 1, examples #26999000-#27000000, loss: 2158.66
2018-02-05 13:02:47,179 - Time taken for 1000000 examples: 284.30 s, 3517.46 examples / s
2018-02-05 13:07:31,441 - Training on epoch 1, examples #27999000-#28000000, loss: 2157.93
2018-02-05 13:07:31,442 - Time taken for 1000000 examples: 284.26 s, 3517.89 examples / s
2018-02-05 13:12:20,000 - Training on epoch 1, examples #28999000-#29000000, loss: 2156.97
2018-02-05 13:12:20,004 - Time taken for 1000000 examples: 288.56 s, 3465.52 examples / s
2018-02-05 13:17:06,050 - Training on epoch 1, examples #29999000-#30000000, loss: 2155.66
2018-02-05 13:17:06,051 - Time taken for 1000000 examples: 286.04 s, 3495.96 examples / s
2018-02-05 13:21:56,627 - Training on epoch 1, examples #30999000-#31000000, loss: 2154.42
2018-02-05 13:21:56,628 - Time taken for 1000000 examples: 290.58 s, 3441.45 examples / s
2018-02-05 13:26:41,004 - Training on epoch 1, examples #31999000-#32000000, loss: 2153.39
2018-02-05 13:26:41,005 - Time taken for 1000000 examples: 284.37 s, 3516.49 examples / s
2018-02-05 13:31:26,601 - Training on epoch 1, examples #32999000-#33000000, loss: 2152.29
2018-02-05 13:31:26,603 - Time taken for 1000000 examples: 285.59 s, 3501.49 examples / s
2018-02-05 13:36:11,844 - Training on epoch 1, examples #33999000-#34000000, loss: 2151.36
2018-02-05 13:36:11,845 - Time taken for 1000000 examples: 285.24 s, 3505.82 examples / s
2018-02-05 13:41:08,003 - Training on epoch 1, examples #34999000-#35000000, loss: 2150.06
2018-02-05 13:41:08,008 - Time taken for 1000000 examples: 296.16 s, 3376.58 examples / s
2018-02-05 13:45:59,593 - Training on epoch 1, examples #35999000-#36000000, loss: 2149.02
2018-02-05 13:45:59,594 - Time taken for 1000000 examples: 291.58 s, 3429.54 examples / s
2018-02-05 13:50:52,455 - Training on epoch 1, examples #36999000-#37000000, loss: 2148.05
2018-02-05 13:50:52,457 - Time taken for 1000000 examples: 292.86 s, 3414.59 examples / s
2018-02-05 13:55:42,711 - Training on epoch 1, examples #37999000-#38000000, loss: 2146.37
2018-02-05 13:55:42,712 - Time taken for 1000000 examples: 290.25 s, 3445.26 examples / s
2018-02-05 14:00:31,112 - Training on epoch 1, examples #38999000-#39000000, loss: 2145.71
2018-02-05 14:00:31,113 - Time taken for 1000000 examples: 288.40 s, 3467.42 examples / s
2018-02-05 14:05:18,087 - Training on epoch 1, examples #39999000-#40000000, loss: 2144.32
2018-02-05 14:05:18,088 - Time taken for 1000000 examples: 286.97 s, 3484.65 examples / s
2018-02-05 14:10:08,383 - Training on epoch 1, examples #40999000-#41000000, loss: 2143.63
2018-02-05 14:10:08,388 - Time taken for 1000000 examples: 290.29 s, 3444.78 examples / s
2018-02-05 14:15:01,954 - Training on epoch 1, examples #41999000-#42000000, loss: 2142.36
2018-02-05 14:15:01,955 - Time taken for 1000000 examples: 293.57 s, 3406.40 examples / s
2018-02-05 14:19:58,021 - Training on epoch 1, examples #42999000-#43000000, loss: 2141.21
2018-02-05 14:19:58,023 - Time taken for 1000000 examples: 296.07 s, 3377.63 examples / s
2018-02-05 14:24:43,944 - Training on epoch 1, examples #43999000-#44000000, loss: 2140.30
2018-02-05 14:24:43,945 - Time taken for 1000000 examples: 285.92 s, 3497.48 examples / s
2018-02-05 14:29:36,938 - Training on epoch 1, examples #44999000-#45000000, loss: 2138.98
2018-02-05 14:29:36,939 - Time taken for 1000000 examples: 292.99 s, 3413.06 examples / s
2018-02-05 14:34:31,522 - Training on epoch 1, examples #45999000-#46000000, loss: 2137.78
2018-02-05 14:34:31,523 - Time taken for 1000000 examples: 294.58 s, 3394.64 examples / s
2018-02-05 14:39:24,775 - Training on epoch 1, examples #46999000-#47000000, loss: 2136.79
2018-02-05 14:39:24,780 - Time taken for 1000000 examples: 293.25 s, 3410.04 examples / s
2018-02-05 14:44:15,172 - Training on epoch 1, examples #47999000-#48000000, loss: 2135.49
2018-02-05 14:44:15,174 - Time taken for 1000000 examples: 290.39 s, 3443.62 examples / s
2018-02-05 14:49:07,628 - Training on epoch 1, examples #48999000-#49000000, loss: 2135.08
2018-02-05 14:49:07,630 - Time taken for 1000000 examples: 292.45 s, 3419.34 examples / s
2018-02-05 14:53:51,284 - Training on epoch 1, examples #49999000-#50000000, loss: 2133.45
2018-02-05 14:53:51,285 - Time taken for 1000000 examples: 283.65 s, 3525.43 examples / s
2018-02-05 14:58:39,403 - Training on epoch 1, examples #50999000-#51000000, loss: 2132.59
2018-02-05 14:58:39,404 - Time taken for 1000000 examples: 288.12 s, 3470.81 examples / s
2018-02-05 15:03:27,455 - Training on epoch 1, examples #51999000-#52000000, loss: 2131.60
2018-02-05 15:03:27,456 - Time taken for 1000000 examples: 288.05 s, 3471.65 examples / s
2018-02-05 15:08:19,622 - Training on epoch 1, examples #52999000-#53000000, loss: 2130.17
2018-02-05 15:08:19,627 - Time taken for 1000000 examples: 292.17 s, 3422.71 examples / s
2018-02-05 15:13:12,975 - Training on epoch 1, examples #53999000-#54000000, loss: 2129.34
2018-02-05 15:13:12,976 - Time taken for 1000000 examples: 293.35 s, 3408.92 examples / s
2018-02-05 15:18:01,815 - Training on epoch 1, examples #54999000-#55000000, loss: 2128.32
2018-02-05 15:18:01,816 - Time taken for 1000000 examples: 288.84 s, 3462.15 examples / s
2018-02-05 15:22:45,226 - Training on epoch 1, examples #55999000-#56000000, loss: 2126.67
2018-02-05 15:22:45,227 - Time taken for 1000000 examples: 283.41 s, 3528.47 examples / s
2018-02-05 15:27:31,026 - Training on epoch 1, examples #56999000-#57000000, loss: 2126.11
2018-02-05 15:27:31,027 - Time taken for 1000000 examples: 285.79 s, 3499.01 examples / s
2018-02-05 15:32:19,805 - Training on epoch 1, examples #57999000-#58000000, loss: 2125.11
2018-02-05 15:32:19,807 - Time taken for 1000000 examples: 288.77 s, 3462.90 examples / s
2018-02-05 15:37:11,024 - Training on epoch 1, examples #58999000-#59000000, loss: 2123.99
2018-02-05 15:37:11,028 - Time taken for 1000000 examples: 291.22 s, 3433.87 examples / s
2018-02-05 15:42:06,631 - Training on epoch 1, examples #59999000-#60000000, loss: 2123.01
2018-02-05 15:42:06,632 - Time taken for 1000000 examples: 295.60 s, 3382.92 examples / s
2018-02-05 15:46:54,707 - Training on epoch 1, examples #60999000-#61000000, loss: 2121.46
2018-02-05 15:46:54,709 - Time taken for 1000000 examples: 288.07 s, 3471.33 examples / s
2018-02-05 15:51:42,019 - Training on epoch 1, examples #61999000-#62000000, loss: 2120.72
2018-02-05 15:51:42,021 - Time taken for 1000000 examples: 287.31 s, 3480.57 examples / s
2018-02-05 15:56:29,973 - Training on epoch 1, examples #62999000-#63000000, loss: 2119.82
2018-02-05 15:56:29,974 - Time taken for 1000000 examples: 287.95 s, 3472.81 examples / s
2018-02-05 16:01:22,243 - Training on epoch 1, examples #63999000-#64000000, loss: 2118.50
2018-02-05 16:01:22,247 - Time taken for 1000000 examples: 292.27 s, 3421.52 examples / s
2018-02-05 16:06:09,893 - Training on epoch 1, examples #64999000-#65000000, loss: 2117.51
2018-02-05 16:06:09,894 - Time taken for 1000000 examples: 287.64 s, 3476.51 examples / s
2018-02-05 16:11:00,706 - Training on epoch 1, examples #65999000-#66000000, loss: 2116.77
2018-02-05 16:11:00,707 - Time taken for 1000000 examples: 290.81 s, 3438.66 examples / s
2018-02-05 16:15:46,906 - Training on epoch 1, examples #66999000-#67000000, loss: 2115.44
2018-02-05 16:15:46,908 - Time taken for 1000000 examples: 286.20 s, 3494.08 examples / s
2018-02-05 16:20:32,582 - Training on epoch 1, examples #67999000-#68000000, loss: 2114.07
2018-02-05 16:20:32,584 - Time taken for 1000000 examples: 285.67 s, 3500.49 examples / s
2018-02-05 16:25:18,195 - Training on epoch 1, examples #68999000-#69000000, loss: 2113.42
2018-02-05 16:25:18,197 - Time taken for 1000000 examples: 285.61 s, 3501.27 examples / s
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-15-50d51e10cfa2> in <module>()
----> 1 model.train(epochs=1, batch_size=1000)

~/.virtualenvs/wiki-graph/lib/python3.5/site-packages/gensim/models/poincare.py in train(self, epochs, batch_size, print_every, check_gradients_every)
    542             self._train_batchwise(
    543                 epochs=self.burn_in, batch_size=batch_size, print_every=print_every,
--> 544                 check_gradients_every=check_gradients_every)
    545             self._burn_in_done = True
    546             logger.info("Burn-in finished")

~/.virtualenvs/wiki-graph/lib/python3.5/site-packages/gensim/models/poincare.py in _train_batchwise(self, epochs, batch_size, print_every, check_gradients_every)
    583                 batch_indices = indices[i:i + batch_size]
    584                 relations = [self.all_relations[idx] for idx in batch_indices]
--> 585                 result = self._train_on_batch(relations, check_gradients=check_gradients)
    586                 avg_loss += result.loss
    587                 if should_print:

~/.virtualenvs/wiki-graph/lib/python3.5/site-packages/gensim/models/poincare.py in _train_on_batch(self, relations, check_gradients)
    442         """
    443         all_negatives = self._sample_negatives_batch([relation[0] for relation in relations])
--> 444         batch = self._prepare_training_batch(relations, all_negatives, check_gradients)
    445         self._update_vectors_batch(batch)
    446         return batch

~/.virtualenvs/wiki-graph/lib/python3.5/site-packages/gensim/models/poincare.py in _prepare_training_batch(self, relations, all_negatives, check_gradients)
    363 
    364         vectors_u = self.kv.syn0[indices_u]
--> 365         vectors_v = self.kv.syn0[indices_v].reshape((batch_size, 1 + self.negative, self.size))
    366         vectors_v = vectors_v.swapaxes(0, 1).swapaxes(1, 2)
    367         batch = PoincareBatch(vectors_u, vectors_v, indices_u, indices_v, self.regularization_coeff)

IndexError: index 13971421 is out of bounds for axis 0 with size 13971421

From the first sight, looks like problem with self.negative.

All files, mentioned in code

Versions

Linux-4.4.0-62-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Sep 14 2017, 22:51:06) 
[GCC 5.4.0 20160609]
NumPy 1.14.0
SciPy 1.0.0
gensim 3.3.0
FAST_VERSION 1
@menshikh-iv menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Feb 20, 2018
menshikh-iv pushed a commit that referenced this issue Mar 14, 2018
…eModel`. Fix #1917 (#1959)

* Fixes bug in negative sampling due to floating point error

* Uses counts in cumsum table instead of probabilities to avoid floating point errors

* Adds failing tests for loading old models and re-training loaded models

* Adds fix for added tests

* Fixes test docstrings

* Updates saved poincare model for tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants