Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About reading Fox Go #752

Closed
hope366 opened this issue Sep 18, 2020 · 39 comments
Closed

About reading Fox Go #752

hope366 opened this issue Sep 18, 2020 · 39 comments

Comments

@hope366
Copy link

hope366 commented Sep 18, 2020

Fox Go game records that contain AI analysis and commentary cannot be read by lizzie. Is there any improvement that makes this possible?

@kaorahi
Copy link
Contributor

kaorahi commented Sep 19, 2020

  1. Could you attach the sgf? (may need to zip it)
  2. What do you mean by "cannot be read"? Do you get an error or a wrong result? (What error? How wrong?)

ref. sanderland/katrain#177 (comment)

@hope366
Copy link
Author

hope366 commented Sep 19, 2020

Thank you for paying attention to my post.
Files that cannot be read by lizzie are as follows, for example.
https://drive.google.com/file/d/1c4IXaS3EoyMqsVPt0yfjxS9gE86GrE8l/view?usp=sharing
The expression "cannot be read" may have been strange.
In short, open the file with the "o" key.
When I try to open the above file, it freezes without any error message. The only way is to forcibly close it from the task manager without accepting any operation.

By the way, KaTrain v1.5.0 can open the above file. However, you cannot turn off analysis by pressing the spacebar while analysis is on. I'll post this issue on KaTrain's Issues.

@kaorahi
Copy link
Contributor

kaorahi commented Sep 20, 2020

The above SGF is opened successfully on my PC (Linux). So it is difficult for me to fix this issue, unfortunately...

@hope366
Copy link
Author

hope366 commented Sep 20, 2020

I did some research to see what's different between when it can be read and when it can't.

It seems that the important point is which engine is set when reading the sgf file.

・ Katago-v1.4.2-cuda10.1-windows-x64
・ Katago-v1.4.2-opencl-windows-x64
・ Katago-v1.5.0-opencl-windows-x64
・ Katago-v1.6.1-gpu-opencl-windows-x64
・ Leela-zero-0.17-win64

If these are set, the sgf file with evaluation value and explanation will be read successfully.

・ Katago-v1.5.0-cuda10.2-windows-x64
・ Katago-v1.6.1-gpu-cuda10.2-windows-x64

If you open an sgf file with evaluation values ​​and explanations with these set, lizzie will freeze.

I feel that "cuda 10.2" is the cause, but I don't know at all because I don't have any knowledge.

After successfully opening the sgf file with evaluation values ​​and explanations, even if you change to the above two engines in the engine settings, you can analyze normally.

kaorahi pushed a commit to kaorahi/lizzie that referenced this issue Sep 20, 2020
@kaorahi
Copy link
Contributor

kaorahi commented Sep 20, 2020

From your hints, I noticed more than 5000 GTP commands were sent needlessly to the engine when the above SGF was loaded. I doubt they may be stressful for the engine.

So I made a quick & dirty patch (7f782ac) to stop them. Would you try it?

https://github.com/kaorahi/lizzie/tree/fix752_fox

(I've not tested it well and I'm not sure whether it works correctly.)

@lightvector
Copy link

This thread makes red flags come off for me.

If that fixes it, I'd caution strongly against just concluding that this is the problem and then thinking everything is fine after no longer sending the commands. If KataGo or Leela Zero receive 5000 GTP commands, they will still parse them correctly and handle them correctly, one by one. If Lizzie were to then handle the responses correctly, one by one, then there would be no problem. Computer programs send each other thousands of lines of data all the time with no problem.

So if no longer sending 5000 useless commands fixes it, then probably there is still an underlying bug - a race condition in Lizzie's handling of commands and responses that is triggered by a large volume of GTP commands and responses that get buffered up in various ways. Presumably Lizzie has a race condition that causes it to lose track of the message and response pairing, so maybe it ends up miscounting and waiting forever on a response, never realizing it was already sent, or something like that.

If this is the case, then even though no longer sending these commands is desirable, before eliminating them one would also want to find and fix the true underlying bug that is causing Lizzie to have the problems. Simply stopping the 5000 commands won't solve this - if there is still the underlying a race condition, that race condition will emerge in other situations too besides this one, so now is the time to debug it, when it is reproducible.

@hope366
Copy link
Author

hope366 commented Sep 20, 2020

I tried [7f782ac] (7f782ac)
It seems that it succeeded in loading without freezing 😄
Thank you for your wonderful work.
Thank you for your comment from lightvector. It's a little difficult for me, but it seems to be interesting.

@kaorahi
Copy link
Contributor

kaorahi commented Sep 20, 2020

I'm not using Lizzie by myself now and don't have enough motivation of tough debug... But I'd like to know what is happening at least. (really freezing? or too slow?)

So, @hope366, would you try...

  1. Revert my patch and start the original Lizzie 0.7.3.
  2. Open GTP console. (View > Panel > GtpConsole)
  3. Load the SGF.
  4. Observe GTP console.

What do you see on GTP console? If you repeat the experiments, do you get the same result every time?

@hope366
Copy link
Author

hope366 commented Sep 20, 2020

If you load SGF without displaying GtpConsole, it freezes and nothing is accepted, but if you load SGF with GtpConsole displayed in advance, the following will be displayed in GtpConsole and SGF itself will also be loaded. Succeeded.

=1755 
=1755 
2020-09-21 01:45:27+0900: =1755 
2020-09-21 01:45:27+0900: Controller: 1756 undo
=1756 
=1756 
2020-09-21 01:45:27+0900: =1756 
2020-09-21 01:45:27+0900: Controller: 1757 undo
=1757 
=1757 
2020-09-21 01:45:27+0900: =1757 
katanetwork.gz> 1759 play W C6
2020-09-21 01:45:27+0900: Controller: 1758 play B P4
=1758 
=1758 
2020-09-21 01:45:27+0900: =1758 
2020-09-21 01:45:27+0900: Controller: 1759 play W C6
=1759 
1759 
2020-09-21 01:45:27+0900: =1759 
katanetwork.gz> 1760 play B R13
katanetwork.gz> 1761 play W O17
2020-09-21 01:45:27+0900: Controller: 1760 play B R13
katanetwork.gz> 1762 play B M17
=1760 
=1760 
2020-09-21 01:45:27+0900: =1760 
2020-09-21 01:45:27+0900: Controller: 1761 play W O17
=1761 
=1761 
2020-09-21 01:45:27+0900: =1761 
2020-09-21 01:45:27+0900: Controller: 1762 play B M17
=1762 
=1762 
2020-09-21 01:45:27+0900: =1762 
katanetwork.gz> 1763 play W N3
katanetwork.gz> 1764 play B L3
2020-09-21 01:45:27+0900: Controller: 1763 play W N3
katanetwork.gz> 1765 undo
katanetwork.gz> 1766 undo
katanetwork.gz> 1767 undo
=1763 
=1763 
2020-09-21 01:45:27+0900: =1763 
2020-09-21 01:45:27+0900: Controller: 1764 play B L3
=1764 
=1764 
2020-09-21 01:45:27+0900: =1764 
katanetwork.gz> 1768 undo
2020-09-21 01:45:27+0900: Controller: 1765 undo
=1765 
=1765 
2020-09-21 01:45:27+0900: =1765 
2020-09-21 01:45:27+0900: Controller: 1766 undo
=1766 
=1766 
2020-09-21 01:45:27+0900: =1766 
2020-09-21 01:45:27+0900: Controller: 1767 undo
=1767 
=1767 
2020-09-21 01:45:27+0900: =1767 
katanetwork.gz> 1769 undo
katanetwork.gz> 1770 undo
katanetwork.gz> 1771 undo
katanetwork.gz> 1772 play B C5
2020-09-21 01:45:27+0900: Controller: 1768 undo
=1768 
=1768 
2020-09-21 01:45:27+0900: =1768 
katanetwork.gz> 1773 play W C4
katanetwork.gz> 1774 play B D5
katanetwork.gz> 1775 play W F3
katanetwork.gz> 1776 play B D9
2020-09-21 01:45:27+0900: Controller: 1769 undo
=1769 
=1769 
2020-09-21 01:45:27+0900: =1769 
2020-09-21 01:45:27+0900: Controller: 1770 undo
=1770 
=1770 
2020-09-21 01:45:27+0900: =1770 
2020-09-21 01:45:27+0900: Controller: 1771 undo
=1771 
=1771 
2020-09-21 01:45:27+0900: =1771 
2020-09-21 01:45:27+0900: Controller: 1772 play B C5
=1772 
=1772 
2020-09-21 01:45:27+0900: =1772 
katanetwork.gz> 1777 play W P3
katanetwork.gz> 1778 play B E4
katanetwork.gz> 1779 play W E3
katanetwork.gz> 1780 undo
katanetwork.gz> 1781 undo
katanetwork.gz> 1782 undo
katanetwork.gz> 1783 undo
katanetwork.gz> 1784 undo
katanetwork.gz> 1785 undo
katanetwork.gz> 1786 undo
katanetwork.gz> 1787 undo
2020-09-21 01:45:27+0900: Controller: 1773 play W C4

@hope366
Copy link
Author

hope366 commented Sep 20, 2020

The above is the result when SGF is read with KataGo-v1.6.1-CUDA set.
The screenshot below is the result with KataGo-v1.4.2-opencl, which is included in the official release of lizzie-v0.7.3.
aaaa

kaorahi pushed a commit to kaorahi/lizzie that referenced this issue Sep 20, 2020
@kaorahi
Copy link
Contributor

kaorahi commented Sep 20, 2020

Strange...

  1. Please observe and paste the BOTTOM lines of GTP console.
  2. In "KataGo-v1.6.1-CUDA + GTP console", the command IDs (e.g. 1787) seem too small. They should be more than 5000. Is the SGF really loaded successfully to the end?
  3. Could you try another version?

Doubtful "command queue" is removed in bbb910a.

https://github.com/kaorahi/lizzie/tree/noQ_for752

@hope366
Copy link
Author

hope366 commented Sep 21, 2020

I'm sorry, I may have misunderstood and loaded another SGF.
I tried again with KataGo-v1.4.2-OpenCL, KataGo-v1.5.0-CUDA, KataGo-v1.6.1-CUDA. Here is the last part displayed in the GTP console.

KataGo-v1.4.2-OpenCL

gtp1 4 2
KataGo-v1.5.0-CUDA

gtp1 5 0
KataGo-v1.6.1-CUDA

gtp1 6 1

@hope366
Copy link
Author

hope366 commented Sep 21, 2020

[7f782ac] (7f782ac)
Here is the display of the GTP console when SGF is loaded with lizzie with the above patch applied. Engine = KataGo-v1.6.1-CUDA
pach1 6 1

kaorahi pushed a commit to kaorahi/lizzie that referenced this issue Sep 21, 2020
@kaorahi
Copy link
Contributor

kaorahi commented Sep 21, 2020

thx. If you are willing to help my further debug:

  1. Do you fail to open the sgf EVERY TIME you try cuda 10.2 without GTP console, and succeed EVERY TIME with GTP console?
  2. If possible, would you try https://github.com/kaorahi/lizzie/tree/noQ_for752 (bbb910a) without GTP console? Can you open the sgf successfully? This version may be safer for race conditions. (But you may suffer annoying lags instead when you push Home/End keys.)
  3. Could you also try https://github.com/kaorahi/lizzie/tree/for752_debug1 (a30bce0) without GTP console? This version should NOT solve the issue. So please paste a screenshot of the title bar when Lizzie freezes. Debug prints are shown on the title bar in this version.

If 1 is "no", you may need to try 2 and 3 several times for confirmation.

I'm trying a wrapped engine with artificial random lags. But I still cannot reproduce the issue on my machine (Linux, no external GPU).

@hope366
Copy link
Author

hope366 commented Sep 21, 2020

  1. Do you fail to open the sgf EVERY TIME you try cuda 10.2 without GTP console, and succeed EVERY TIME with GTP console?

Yes, it always succeeds with the GTP console and always freezes without the GTP console.

bbb910a
I tried this, but it seems to freeze no matter how many times I try it.

a30bce0
Immediately after loading the SGF, lizzie freezes and at the same time various information is displayed at high speed in the title bar, and finally it looks like the screenshot below.

errormess

kaorahi pushed a commit to kaorahi/lizzie that referenced this issue Sep 21, 2020
@kaorahi
Copy link
Contributor

kaorahi commented Sep 21, 2020

thx. Then never mind bbb910a. As a start point, it's a good news that you can reproduce the issue stably and I can get information from the title bar in a30bce0.

Next, I'd like to find where the program is hang up. Would you try https://github.com/kaorahi/lizzie/tree/for752_debug2 (c1304ed) in the same way?

@hope366
Copy link
Author

hope366 commented Sep 21, 2020

c1304ed I tried this.
See the screenshot below to see what the title bar looks like when lizzie freezes.

errormess1

kaorahi pushed a commit to kaorahi/lizzie that referenced this issue Sep 21, 2020
@kaorahi
Copy link
Contributor

kaorahi commented Sep 21, 2020

nice! You captured the functions where the program was stopped. The UI thread was in sendCommandToLeelaz(). The reader thread, that receives outputs from the engine, was in parseLine().

Then let's specify the exact points in the functions with https://github.com/kaorahi/lizzie/tree/for752_debug3 (f04bb62).

@hope366
Copy link
Author

hope366 commented Sep 21, 2020

f04bb62 I tried this. See the screenshot below.

error f04bb

@kaorahi
Copy link
Contributor

kaorahi commented Sep 21, 2020

Thanks to your screenshots, we have reached the point.

It is unexpected result for me. @lightvector, would you give me any advice?

In Leelaz.java, the reader thread seems to be stopped between these two debugPrint. This is reasonable as expected to synchronized (this).

  private void parseLine(String line) {
    debugPrint("@parseLine:" + line);
    synchronized (this) {
      debugPrint("parseLine1:" + line);

But why did the UI thread hang up between these two debugPrint? It suggests some overflow of the stream rather than a race condition between two threads?

        debugPrint("sendCommandToLeelaz5:" + command);
        outputStream.write((command + "\n").getBytes());
        outputStream.flush();
        lastSentCommand = command;
        debugPrint("sendCommandToLeelaz6:" + command);

@lightvector
Copy link

Yeah, it sounds like it might be some sort of thread deadlock or buffer issue, rather than a race (although maybe it's a race that causes a later deadlock). I haven't done a dive into Lizzie's code in a while, but I could take a look in the next day or two, certainly.

@kaorahi
Copy link
Contributor

kaorahi commented Sep 21, 2020

thx. I guess...

  • The UI thread is stopped because the pipe from Lizze to KataGo is full. This is not solved because KataGo is stopped.
  • KataGo is stopped because the pipe from KataGo to Lizzie if full. This is not solved because the reader thread is stopped.
  • The reader thread is stopped because the UI thread is in the mutual exclusion area. This is not solved because UI thread is stopped.

@lightvector
Copy link

lightvector commented Sep 21, 2020

I have not dug into the code but based on your description it sounds like this is the fault of the UI thread, it sounds like the UI thread is the one that is implemented incorrectly. It is usually not a good idea to hold a lock at the same time as you could be blocked indefinitely on something else.

Conceptually, if we zoom out and think about what we would like to happen in this situation: we would like the UI thread to just wait and to not block anything, just wait. The reader thread should continue, the KataGo response processing should continue, and at the end, the UI thread can be updated based on the results after all the commands are processed. So it is a bug if the UI thread is capable of blocking the reader for more than very brief periods of time.

Can we make the UI thread not block the reader thread while it waits? The right way to do this might depend on the code structure - whether it is to split the fields of the object and use separate locks for reading-related fields vs writing-related fields, or to make the UI thread loop on a condition variable so it can drop the lock and sleep while it's not done, or something else similar.

@kaorahi
Copy link
Contributor

kaorahi commented Sep 22, 2020

If the above trilemma guess is correct, we will be happy if we can move these two lines

        outputStream.write((command + "\n").getBytes());
        outputStream.flush();

in sendCommandToLeelaz() to somewhere outside synchronized (this) {...}. Note that sendCommandToLeelaz() is called indirectly in both the UI thread and the reader thread. Both callers are inside synchronized (this) {...}.

I am considering two plans.

Plan A:
Make another "writer thread". The writer thread sends the commands in the command queue to KataGo and then sleeps until it is triggered again. The UI thread and the reader thread never call outputStream.write() by themselves. They only manipulate the command queue and trigger the writer thread.

Plan B:
Modify all synchronized (this) {...} so that sendCommandToLeelaz() is never called inside these {...}.

If possible, Plan A is cleaner and needs smaller changes of the existing codes. Plan B requires careful rewrites of various parts, that can be broken very easily by casual modifications in future developments.

At present, Plan A failed because I know almost nothing about Java. Plan B failed because it needs many rewrites together with careful pursuits of the existing codes.

Unfortunately, there are several functions in Leelaz.java that are written like

  public void playMove(Stone color, String move) {
    synchronized (this) {
      ...
      sendCommand("play " + colorString + " " + move);
      ...
    }
  }

and I am not sure what these synchronized intend actually. These playMove() etc. are called even when we build a game history only internally (e.g. parseSgf()).

In my current opinion, I would prefer my quick & dirty patch rather than Plan B because Plan B seems too fragile for future changes. The codes in Lizzie are not controlled strongly. Various developers (including me) have written various things as they like. As for Plan A, I do not have the common sense of Java to decide whether it is feasible.

Can I hear your comments?

Personally, I am in "maintenance phase" for Lizzie (though I am not an official staff of this project). I am willing to fix issues as far as they can be solved by small changes. I do not have motivations of large rewrites.

@hope366
Copy link
Author

hope366 commented Sep 22, 2020

Professionally, it seems that a major modification is desirable, and I would like to see its realization, but I am satisfied because it is very convenient because the SGF can be loaded successfully by using quick & dirty patch
Thank you for taking the time and effort to respond to my trivial request.

@kaorahi
Copy link
Contributor

kaorahi commented Sep 22, 2020

@hope366, thanks for many experiments. Your reports are essential for me to understand the problem.

This note seems to help implementation of Plan A, though I've not tried it yet.
https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/lang/doc-files/threadPrimitiveDeprecation.html

@hope366
Copy link
Author

hope366 commented Sep 22, 2020

I don't know how to use java myself, so I'd be very grateful if you could just show me a fix.
If it's a simple experiment like this one, I would be happy to do it, so please don't hesitate to let me know.
https://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/lang/doc-files/threadPrimitiveDeprecation.html
This is too difficult for me to understand.
I hope that this paper will be used to bring Plan A to fruition.

kaorahi pushed a commit to kaorahi/lizzie that referenced this issue Sep 23, 2020
@kaorahi
Copy link
Contributor

kaorahi commented Sep 23, 2020

I'm trying to implement plan A. @hope366, could you try https://github.com/kaorahi/lizzie/tree/for752_planA1 (7d3a4d3).

(I hope a code review by Java programmers. I know nothing about Java.)

@hope366
Copy link
Author

hope366 commented Sep 23, 2020

7d3a4d3 I tried this. I've tried some FOX GO game records that include commentary and many branches (those that lizzie might freeze), but they all loaded without problems.

@lightvector
Copy link

@kaorahi - Nice work! If a simple change like this, to move the code to a writer thread fixes the deadlock cycle, then this is excellent and I believe your overall approach looks good.

As a review, I can recommend some details to make sure there will be no problems from this kind of change. I do not have as much familiarity with the Lizzie code as you do, so I do not have a good understanding of the dependencies and what different things synchronize on or block what other things. However, this looks like a very locally-understandable and simple change so I think I can make a recommendation.

Currently there is added synchronization on two different things, but the pattern for code like this (in any language, not just Java) is generally:

  • Writer thread: in a single synchronized block: while(condition is not true) wait
  • Notifier thread: in a single synchronized block: do thing that makes condition true, then notify.

This way of doing the pattern removes any risk of races or spurious wakeup problems. if the condition is already true, the writer thread will never sleep in the first place. If the condition is false, then it will wait, and since it is in a single synchronized block, there is never a possibility of writer thread thinking condition is false, just about to wait, but suddenly the the notifier comes in and makes it true, sends the notify, but it happens before the writer has actually started waiting so the writer never gets the notify.

Based on this, I am momentarily soon leaving some comments on your commit that I hope will be helpful. Let me know what you think!

@lightvector
Copy link

lightvector commented Sep 24, 2020

Okay I finished my review and comments on the commit: 7d3a4d3.

I'm not an expert in Lizzie's internals, and the moment I'm not in a ready position to easily build and test Lizzie. So I'm mostly I'm just basing my comments around the general understanding I have of how to write good multithreaded code, and reading your commit along with Lizzie's source. So it is possible you find a bug or misunderstanding, or the code I suggested doesn't work for some reason. Hopefully, however, the comments I have also help make the principles and understanding clear behind these suggestions.

If you try the suggestions and it doesn't work, and you think the code is too complicated, then okay, sure just do the quick and dirty patch. :)

That would be the approach if you are just giving up on the code and saying it is too messy and unfamiliar to maintain, and needs to ultimately be rewritten. In that case, it would be indeed be legitimate and valid to leave some bugs in there and just try to mitigate them, and wait for featurecat or someone else to come along eventually with something new and shiny to replace it entirely. :)

@kaorahi
Copy link
Contributor

kaorahi commented Sep 24, 2020

Thanks a lot for your kind review! As a programming lover, I enjoy learning a new tesuji from this chance :)

Here is the updated version: https://github.com/kaorahi/lizzie/tree/for752_planA3 (0a047eb). @hope366, would you test this?

@hope366
Copy link
Author

hope366 commented Sep 24, 2020

planA1 (7d3a4d3)From the state with this change, further planA3(0a047eb)I made this change, is that okay?
As with Plan A, we have been able to load the SGF without any problems.😄

By the way, I asked Lizzie's Issues about two new posts and a pull request about GONGJE.
I would appreciate it if you could read it.

@kaorahi
Copy link
Contributor

kaorahi commented Sep 25, 2020

thx. Then I'll make two pull requests: the revised plan A as the fix of this issue, and the quick & dirty patch for acceleration of loading of huge SGF.

Yes. I've noticed your posts. But, unlike 62032e4 e0a49d4 etc., your new issues are not simple mistakes. They are requests of improvements with additional coding actually. I'll send patches if I can find easy ways to fix them.

@hope366
Copy link
Author

hope366 commented Sep 25, 2020

Thanks for making lizzie even more useful and easy to use 😄
Deepest thanks to kaorahi and lightvector for their excellent work.

I'm currently implementing these two separately, is it possible to implement and use them at the same time?

@kaorahi
Copy link
Contributor

kaorahi commented Sep 26, 2020

Yes. You can apply both of them at the same time.

@hope366
Copy link
Author

hope366 commented Sep 26, 2020

thx a lot.
Thank you for creating some pull requests.
I would like to slowly verify it from now on 😄

@hope366
Copy link
Author

hope366 commented Sep 26, 2020

I tried to verify each of the created pull requests.
For those unfamiliar with computers, this improvement will be of great help.
I have a question.

Only this item could not be confirmed.
Under what conditions can this be confirmed?
"Engine is down." Where does this appear?

@kaorahi
Copy link
Contributor

kaorahi commented Sep 26, 2020

thx for the check. The message is shown as the screenshot in #747 (comment) together with the engine's error message in GTP console. We should've implemented this earlier.

@hope366 hope366 closed this as completed Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants