Skip to content
This repository has been archived by the owner on Apr 16, 2019. It is now read-only.

Trezor MB Java API - USB attach takes long time on Windows #98

Closed
martin-lizner opened this issue Jul 3, 2016 · 11 comments
Closed

Trezor MB Java API - USB attach takes long time on Windows #98

martin-lizner opened this issue Jul 3, 2016 · 11 comments

Comments

@martin-lizner
Copy link

First choice where to report the bug should be Multibit hardware project, which Im referring to as Java API. I already did that - link bellow. But I have a strong feeling (not 100% however) that problem is in Trezor firmware.

The problem is that after providing the PIN to Trezor, doing software detach in HID Java API and starting the API again, USB init takes about 30 seconds to complete. The only cure is to do hard USB detach (unplug cable :-) So why do I feel like its a Trezor problem?

  1. This happens only after PIN was provided. In other words, doing soft detach & attach in MB Java API before providing PIN is < 1 sec.
  2. Problem does not happen with KeepKey device which is using the very same MB Java API.

Repeat scenario is described here:
Multibit-Legacy/multibit-hardware#29

FW: 1.3.5, 1.3.6

Thanks, M.

@jhoenicke
Copy link
Contributor

I looked into this, but this is not so easy to fix.

Background: While libopencm3 handles low-level usb messages in interrupts, one has to regularly call usbd_poll from the main thread to even answer low level requests. Trezor firmware doesn't do this when it is busy, e.g. computing the bip32 master key from the bip39 seed.

In this case the OS usb stack or some other part of Java asks Trezor for a STRING REPORT (a very low level message that is usually handled by libopencm3). I'm not sure what triggers it. An enumerate for example would trigger this. When this request isn't answered within a few milli seconds, the libopencm3 usb stack gives up. This causes chaos, especially when later the STRING REPORT is handled and there is an unmatched packet in the queue.

The workaround on the PC side is to not do anything like an enumerate, directly after a package was sent to the Trezor for at least three seconds. python-trezor uses a timeout of 10 seconds before reenumerating to check if device is still there.

Maybe we could call usbd_poll in long running operations like pbkdf2, but I'm not sure how often one must call it to avoid this problem. If it must be called more often than once every 10 ms, then the same problem would be with public key derivation and signing. Also simply calling usbd_poll is dangerous as it also handles high-level messages and lead to unexpected recursions.

@martin-lizner
Copy link
Author

Thanks for the reply! I must say I dont understand much of what you wrote, but Im not very good at low level stuff :-) Just my limited view - could it be much simplier? Again, KeepKey does not suffer from this problem.

Fixing this would not only help my app Trezor SSH Agent but also improve experience of Trezor users on Multibit.

M.

@jhoenicke
Copy link
Contributor

I'm not so familiar with the Java code that I know who reenumerates the usb. But someone does this every 0.5 s. If you can somehow stop this task, that would help. If not, at least make the polling much slower, but then it can still happen if it polls at the wrong moment.

Keepkey has the same problem, but only if additional passphrase is enabled. If passphrase isn't enabled keepkey never performs the time consuming bip39 computation (I think it caches the bip32-root node in that case).

@martin-lizner
Copy link
Author

I submitted new issue to hid4java project. The problem is opened in three different projects now, lets hope someone will fix it. I know I cant.

gary-rowe/hid4java#42

@jhoenicke
Copy link
Contributor

There are three things contributing to this problem; fixing one of them would be enough:

  1. java implementation enumerates the hid device every 0.5 seconds.
  2. Trezor doesn't call usbd_poll often enough to handle the request in time.
  3. the usb stack on Trezor (libopencm3) hangs up after a setup request times out and doesn't recover from this.

I'm hoping to get 3. fixed in the next firmware. I'm still trying to understand why this hang-up happens and how to detect that it happened. I already know how to fix it, when it happened....

@prusnak
Copy link
Member

prusnak commented Jul 5, 2016

Not sure if I understand what 0.5 second constant means.

In our stack we enumerate 500ms AFTER the previous enumeration was finished.

If that's not the case (i.e. mbhw enumerates every 500ms without waiting for the result of previous operation), I suggest to change the logic.

@jhoenicke
Copy link
Contributor

I didn't find this code that enumerates... Yes this completely explains it.

The Trezor firmware has a bug that if an enumerate happens during the "Waking up" phase, the device will go to this state where it cannot answer control requests and gets very slow. If you enumerate 0.5 seconds after the last enumeration finished, then it is guaranteed to happen.

Is it easy to change the Java implementation? At least don't enumerate the first five seconds after the last message was sent to the Trezor? This would be a workaround for current firmware.

@jhoenicke
Copy link
Contributor

Of course, the bug is really on the Trezor side. It should be able to handle an enumerate at any time. One problem is that the usb stack on Trezor side implemented in libopencm3 gets confused. I opened libopencm3/libopencm3#668 and have a preliminary fix for this. But it still needs some more thought.

jhoenicke added a commit to jhoenicke/trezor-mcu that referenced this issue Jul 7, 2016
This patch adds calls to usbPoll in the progress callback.  This
should address trezor#98.

We call usbDelay instead of Poll, to call usbd_poll several times.
Otherwise it would only handle one event instead of handling all
events that were pending so far.  The ugly magic number 5 is a guess.

Note that we also need to set usbTiny, so that we don't recursively
process messages.  Since we don't know whether usbTiny is set, we
need to store the old value (especially true for u2f).
@jhoenicke
Copy link
Contributor

Note that the above fix is not enough; one also needs to fix another bug in libopencm3. Also it would probably better to shorten the update interval for bip39.

jhoenicke added a commit to jhoenicke/trezor-mcu that referenced this issue Jul 10, 2016
This patch adds calls to usbPoll in the progress callback.  This
should address trezor#98.

We call usbDelay instead of Poll, to call usbd_poll several times.
Otherwise it would only handle one event instead of handling all
events that were pending so far.  The ugly magic number 5 is a guess.

Note that we also need to set usbTiny, so that we don't recursively
process messages.  Since we don't know whether usbTiny is set, we
need to store the old value (especially true for u2f).

This fix also relies on another fix in libopencm3.
@jhoenicke
Copy link
Contributor

jhoenicke commented Jul 12, 2016

I have opened the pull request libopencm3/libopencm3#672. Can you try if it solves your problems? You can try the below patch to build the firmware with docker.

diff --git a/firmware-docker-build.sh b/firmware-docker-build.sh
index 2b32c6c..bbc8ff4 100755
--- a/firmware-docker-build.sh
+++ b/firmware-docker-build.sh
@@ -10,6 +10,11 @@ docker run -t -v $(pwd)/output:/output $IMAGETAG /bin/sh -c "\
        cd trezor-mcu && \
        git checkout $FIRMWARETAG && \
        git submodule update --init && \
+       cd vendor/libopencm3 && \
+       git remote add jhoenicke https://github.com/jhoenicke/libopencm3 && \
+       git fetch jhoenicke && \
+       git checkout jhoenicke/fix1 && \
+       cd ../.. && \
        make -C vendor/libopencm3 && \
        make && \
        make -C firmware && \

I also have a branch in trezor-mcu called pollfix that calls usbd_poll more often; it still needs the above pull request to work.

prusnak pushed a commit that referenced this issue Nov 9, 2016
This patch adds calls to usbPoll in the progress callback.  This
should address #98.

We call usbDelay instead of Poll, to call usbd_poll several times.
Otherwise it would only handle one event instead of handling all
events that were pending so far.  The ugly magic number 5 is a guess.

Note that we also need to set usbTiny, so that we don't recursively
process messages.  Since we don't know whether usbTiny is set, we
need to store the old value (especially true for u2f).

This fix also relies on another fix in libopencm3.
@prusnak
Copy link
Member

prusnak commented Nov 9, 2016

BIP39 Interval was shortened in trezor/trezor-crypto@9101c05

Pulled the change from pollfix branch in 9287dd7

Last thing missing is upstream merge of libopencm3/libopencm3#672

Let's wait a few days and use local copy of libopencm3 if this does not happen before the next release.

Closing this for now.

@prusnak prusnak closed this as completed Nov 9, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants