Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading of the session record failed at the end of a long session using teleport-node. #10660

Closed
kmdkuk opened this issue Feb 28, 2022 · 0 comments
Labels
audit-log Issues related to Teleports Audit Log bug c-10 Internal Customer Reference c-cb Internal Customer Reference

Comments

@kmdkuk
Copy link

kmdkuk commented Feb 28, 2022

Description

What happened:

Uploading of the session record failed at the end of a long session using teleport-node.
The problem seems to be caused by the number of temporarily stored *.part files exceeding the value of RLIMIT_NOFILE (Value that can be confirmed with ulimit -n).
In our environment, RLIMIT_NOFILE was set to the default value of 1024. Also, more than 1000 *.part files were created.

What you expected to happen:

We worked around this by setting a large enough RLIMIT_NOFILE setting,
but we expect that no matter how long the session is,
it should be able to upload successfully without relying on the RLIMIT_NOFILE setting to some extent.

Reproduction Steps

As minimally and precisely as possible, describe step-by-step how to reproduce the problem.

  1. Set the RLIMIT_NOFILE setting of teleport-node small so that it is easy to verify. (e.g. ulimit -n 256 && teleport start ${SOME_OPTIONS})
  2. Log in to teleport-node from the Web UI.
  3. Continue to output large files with cat etc.
  4. Make sure there are more *.part files under the /var/lib/teleport/log/upload/streaming/default/multi/*/{SESSION_ID} directory than the value of the RLIMIT_NOFILE setting.
  5. End the session.
  6. The upload fails. (At this time, an empty directory is created in teleport-auth as in Teleport may leak directories under /var/lib/teleport/log/records/multi #9646. We couldn't reproduce it well, but when the upload was retried, an infinite number of empty directories were created and the volume inode used for teleport-auth was exhausted. )
    • Probably the problem is trying to open all *.part files in func (h *Handler) CompleteUpload.
      for _, part := range parts {
      partPath := h.partPath(upload, part.Number)
      file, err := os.Open(partPath)
      if err != nil {
      return trace.ConvertSystemError(err)
      }
      files = append(files, file)
      readers = append(readers, file)
      }

Server Details

  • Teleport version (run teleport version): Teleport v8.0.6 git: go1.17.
  • Server OS (e.g. from /etc/os-release): Ubuntu 20.04.3 LTS
  • Where are you running Teleport? (e.g. AWS, GCP, Dedicated Hardware): on-premises and GCP
  • Additional details:

Debug Logs

Please include or attach debug logs, when appropriate. Obfuscate sensitive information!

  • Start Teleport with --debug flag (teleport --debug)
  • Run tsh with --debug flag (tsh --debug)
# Log in to teleport-node from the Web UI.
cybozu@node-neco-0:~$ ps aux | grep teleport
cybozu         1  1.3  0.3 1614744 80940 ?       Ssl  05:35   0:01 teleport start --debug -c /etc/teleport/teleport.yaml --roles=node --labels=team=neco --diag-addr=0.0.0.0:3020 --insecure
cybozu        19  0.2  0.3 1466108 63052 ?       Sl   05:36   0:00 /usr/local/teleport/bin/teleport exec
cybozu        47  0.0  0.0   6308   740 pts/0    R+   05:37   0:00 grep teleport
# To make it easier to reproduce, the `RLIMIT_NOFILE` setting is intentionally smaller than it actually is.
cybozu@node-neco-0:~$ cat /proc/1/limits | grep "Max open files"
Max open files            256                  256                  files
cybozu@node-neco-0:~$ ls -al /proc/1/fd | wc -l
24
# Continue to output large files with `cat` etc.
# Make sure there are more `*.part` files under the `/var/lib/teleport/log/upload/streaming/default/multi/*/{SESSION_ID}` directory than the value of the `Max open files` setting.
cybozu@node-neco-0:~$ ls -al /var/lib/teleport/log/upload/streaming/default/multi/a03faf4a-d1d6-49ed-9871-169b75b218d6/f57c760b-c8bf-4c15-9e7a-eaf386c49d1c/
total 35140
drwx------. 2 cybozu cybozu  12288 Feb 28 05:43 .
drwx------. 3 cybozu cybozu   4096 Feb 28 05:36 ..
-rw-------. 1 cybozu cybozu 131777 Feb 28 05:40 100.part
-rw-------. 1 cybozu cybozu 136316 Feb 28 05:40 101.part
-rw-------. 1 cybozu cybozu 135318 Feb 28 05:40 102.part
-rw-------. 1 cybozu cybozu 135761 Feb 28 05:40 103.part
-rw-------. 1 cybozu cybozu 135035 Feb 28 05:40 104.part
-rw-------. 1 cybozu cybozu 136103 Feb 28 05:40 105.part
-rw-------. 1 cybozu cybozu 134710 Feb 28 05:40 106.part
~~~

cybozu@node-neco-0:~$ ls -al /var/lib/teleport/log/upload/streaming/default/multi/a03faf4a-d1d6-49ed-9871-169b75b218d6/f57c760b-c8bf-4c15-9e7a-eaf386c49d1c | wc -l
266
cybozu@node-neco-0:~$ exit
cybozu@gcp0-boot-0:~$ kubectl logs -n teleport node-neco-0
~~~
2022-02-28T05:44:28Z WARN             Failed to complete upload. error:[
ERROR REPORT:
Original Error: *trace.AccessDeniedError failed to execute command /var/lib/teleport/log/upload/streaming/default/multi/a03faf4a-d1d6-49ed-9871-169b75b218d6/f57c760b-c8bf-4c15-9e7a-eaf386c49d1c/239.part error:  too many open files
Stack Trace:
	/work/teleport/lib/events/filesessions/filestream.go:145 github.com/gravitational/teleport/lib/events/filesessions.(*Handler).CompleteUpload
	/work/teleport/lib/events/stream.go:625 github.com/gravitational/teleport/lib/events.(*sliceWriter).completeStream
	/work/teleport/lib/events/stream.go:498 github.com/gravitational/teleport/lib/events.(*sliceWriter).receiveAndUpload
	/usr/local/go/src/runtime/asm_amd64.s:1581 runtime.goexit
User Message: failed to execute command /var/lib/teleport/log/upload/streaming/default/multi/a03faf4a-d1d6-49ed-9871-169b75b218d6/f57c760b-c8bf-4c15-9e7a-eaf386c49d1c/239.part error:  too many open files] events/stream.go:628
2022-02-28T05:44:28Z WARN [SESSION:N] Failed to complete stream. error:[
ERROR REPORT:
Original Error: *trace.AccessDeniedError failed to execute command /var/lib/teleport/log/upload/streaming/default/multi/a03faf4a-d1d6-49ed-9871-169b75b218d6/f57c760b-c8bf-4c15-9e7a-eaf386c49d1c/239.part error:  too many open files
Stack Trace:
	/work/teleport/lib/events/filesessions/filestream.go:145 github.com/gravitational/teleport/lib/events/filesessions.(*Handler).CompleteUpload
	/work/teleport/lib/events/stream.go:625 github.com/gravitational/teleport/lib/events.(*sliceWriter).completeStream
	/work/teleport/lib/events/stream.go:498 github.com/gravitational/teleport/lib/events.(*sliceWriter).receiveAndUpload
	/usr/local/go/src/runtime/asm_amd64.s:1581 runtime.goexit
User Message: failed to execute command /var/lib/teleport/log/upload/streaming/default/multi/a03faf4a-d1d6-49ed-9871-169b75b218d6/f57c760b-c8bf-4c15-9e7a-eaf386c49d1c/239.part error:  too many open files] events/auditwriter.go:479
2022-02-28T05:44:28Z DEBU [SESSION:N] Session has encountered 32003 slow writes out of 89056. Check disk and network on this server. events/auditwriter.go:384
2022-02-28T05:44:28Z INFO [SESSION:N] Closing session f57c760b-c8bf-4c15-9e7a-eaf386c49d1c. srv/sess.go:640
2022-02-28T05:44:28Z DEBU [SESSION:N] Session has encountered 32003 slow writes out of 89056. Check disk and network on this server. events/auditwriter.go:384
2022-02-28T05:44:28Z DEBU [TERM:LOCA] Closed PTY srv/term.go:300
2022-02-28T05:44:28Z DEBU [SESSION:N] Stopping poll and sync of terminal size to all parties. srv/sess.go:1205
2022-02-28T05:44:28Z INFO [AUDIT]     session.end cluster_name:gcp0 code:T2004I ei:89055 enhanced_recording:false event:session.end interactive:true namespace:default participants:[cybozu] server_addr:10.64.0.223:3022 server_hostname:node-neco-0 server_id:f570aced-d7ba-43f3-a8ad-a73be1b6d68c server_labels:map[team:neco] session_recording:node session_start:2022-02-28T05:36:56.250485676Z session_stop:2022-02-28T05:44:28.255229137Z sid:f57c760b-c8bf-4c15-9e7a-eaf386c49d1c time:2022-02-28T05:44:28.255Z uid:05be224d-48b9-4711-bc82-83186dae8419 user:cybozu events/emitter.go:324
2022-02-28T05:44:32Z DEBU [UPLOAD]    Scanned 1 uploads, started 1 in /var/lib/teleport/log/upload/streaming/default. filesessions/fileasync.go:289
2022-02-28T05:44:32Z DEBU [UPLOAD]    Session upload completed. duration:60.918939ms session-id:f57c760b-c8bf-4c15-9e7a-eaf386c49d1c filesessions/fileasync.go:443

gz#4249

@kmdkuk kmdkuk added the bug label Feb 28, 2022
@programmerq programmerq added the c-cb Internal Customer Reference label Mar 2, 2022
@zmb3 zmb3 added the audit-log Issues related to Teleports Audit Log label Mar 2, 2022
zmb3 added a commit that referenced this issue Mar 3, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
zmb3 added a commit that referenced this issue Mar 3, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
zmb3 added a commit that referenced this issue Mar 3, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
zmb3 added a commit that referenced this issue Mar 3, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
zmb3 added a commit that referenced this issue Mar 3, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
zmb3 added a commit that referenced this issue Mar 3, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
zmb3 added a commit that referenced this issue Mar 4, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
zmb3 added a commit that referenced this issue Mar 4, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
zmb3 added a commit that referenced this issue Mar 4, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
zmb3 added a commit that referenced this issue Mar 4, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
zmb3 added a commit that referenced this issue Mar 4, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
zmb3 added a commit that referenced this issue Mar 8, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
zmb3 added a commit that referenced this issue Mar 8, 2022
When completing a file-based upload, open the parts files one at a time
and write them to the upload, closing each file before opening the next
one.

This is preferrable to opening them all at once and closing all files at
the end, because it consumes less file descriptors.

Updates #10660
@ArunNadda ArunNadda added the c-10 Internal Customer Reference label Apr 27, 2022
@zmb3 zmb3 closed this as completed Mar 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit-log Issues related to Teleports Audit Log bug c-10 Internal Customer Reference c-cb Internal Customer Reference
Projects
None yet
Development

No branches or pull requests

4 participants