Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidb OOM during analyze table phase of lightning import #27187

Closed
fubinzh opened this issue Aug 13, 2021 · 5 comments
Closed

tidb OOM during analyze table phase of lightning import #27187

fubinzh opened this issue Aug 13, 2021 · 5 comments
Assignees
Labels
severity/major sig/planner SIG: Planner type/bug The issue is confirmed as a bug.

Comments

@fubinzh
Copy link

fubinzh commented Aug 13, 2021

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. Use lightning to import 70GB of test data
  2. Waiting the lightning import finished, and check tidb status.

2. What did you expect to see? (Required)

Lightning import should succeed, including analyze table phase at the end of the lightning import.

3. What did you see instead (Required)

At the end of lightning import, there is a analyze table task, it fails due to TiDB OOM.

lightning log:
[2021/08/12 16:15:02.902 +08:00] [WARN] [util.go:143] ["analyze table failed but going to try again"] [table=aa.bb] [query="ANALYZE TABLE aa.bb"] [args=null] [retryCnt=0] [error="invalid connection"]
[2021/08/12 16:15:02.902 +08:00] [WARN] [util.go:131] ["analyze table retry start"] [table=aa.bb] [query="ANALYZE TABLE aa.bb"] [args=null] [retryCnt=1]
[2021/08/12 16:15:05.904 +08:00] [WARN] [util.go:146] ["analyze table failed with no retry"] [table=aa.bb] [query="ANALYZE TABLE aa.bb"] [args=null] [retryCnt=1] [error="dial tcp 172.16.6.151:4000: connect: connection refused"]
[2021/08/12 16:15:05.904 +08:00] [ERROR] [table_restore.go:892] ["analyze failed"] [table=aa.bb] [takeTime=4m1.76799318s] [error="analyze table failed: dial tcp 172.16.6.151:4000: connect: connection refused"]
[2021/08/12 16:15:05.904 +08:00] [WARN] [table_restore.go:762] ["analyze table failed, will skip this error and go on"] [table=aa.bb] [error="analyze table failed: dial tcp 172.16.6.151:4000: connect: connection refused"]
[2021/08/12 16:15:05.904 +08:00] [INFO] [restore.go:1458] ["restore all tables data completed"] [takeTime=44m35.187444468s] []
[2021/08/12 16:15:05.904 +08:00] [INFO] [restore.go:1078] ["everything imported, stopping periodic actions"]

tidb OOMs
[root@centos76_vm log]# dmesg -Tw | grep -i out
...
[Thu Aug 12 16:15:01 2021] [] out_of_memory+0x4b6/0x4f0
[Thu Aug 12 16:15:01 2021] Out of memory: Kill process 14986 (tidb-server) score 972 or sacrifice child
[Thu Aug 12 16:15:01 2021] [] out_of_memory+0x4b6/0x4f0
[Thu Aug 12 16:15:01 2021] Out of memory: Kill process 14992 (tidb-server) score 972 or sacrifice child

Notes: I retested this scenario, and this time tidb had OOM warn, restarted (even though it didn't OOM)
Tidb log:
[2021/08/12 19:59:00.129 +08:00] [WARN] [grpclogger.go:81] ["grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF""] [system=grpc] [grpc_log=true]
[2021/08/12 19:59:15.654 +08:00] [WARN] [memory_usage_alarm.go:139] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=16655896576] ["system memory usage"=14556434432] ["tidb-server memory usage"=11517374872] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1001_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]

/var/log/messages of tidb node:
Aug 12 19:59:47 centos76_vm systemd: tidb-4000.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 12 19:59:47 centos76_vm systemd: Unit tidb-4000.service entered failed state.
Aug 12 19:59:47 centos76_vm systemd: tidb-4000.service failed.
[2021/08/12 19:59:00.129 +08:00] [WARN] [grpclogger.go:81] ["grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF""] [system=grpc] [grpc_log=true]
[2021/08/12 19:59:15.654 +08:00] [WARN] [memory_usage_alarm.go:139] ["tidb-server has the risk of OOM. Running SQLs and heap profile will be recorded in record path"] ["is server-memory-quota set"=false] ["system memory total"=16655896576] ["system memory usage"=14556434432] ["tidb-server memory usage"=11517374872] [memory-usage-alarm-ratio=0.8] ["record path"="/tmp/1001_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage/record"]
/var/log/messages of tidb node:
Aug 12 19:59:47 centos76_vm systemd: tidb-4000.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 12 19:59:47 centos76_vm systemd: Unit tidb-4000.service entered failed state.
Aug 12 19:59:47 centos76_vm systemd: tidb-4000.service failed.

Full lightning & TiDB logs, can be found at:
http://minio.pingcap.net:9000/minio/nfs/logs/tidb_oom/

4. What is your TiDB version? (Required)

TiDB versioni: v5.0.0-nightly-20210809
Cluster topo: each node is 8C 16G.
tidb_topo

@fubinzh fubinzh added type/bug The issue is confirmed as a bug. severity/major labels Aug 13, 2021
@fubinzh fubinzh changed the title tidb OOM during analyze table phase of lightning import of 70GB data tidb OOM during analyze table phase of lightning import Aug 13, 2021
@jingshanglu jingshanglu added the sig/planner SIG: Planner label Aug 13, 2021
@chrysan
Copy link
Contributor

chrysan commented Aug 17, 2021

/assign

@fubinzh
Copy link
Author

fubinzh commented Aug 18, 2021

Tried to do lightning import when @@global.tidb_analyze_version = 1, tidb OOM issue not observed for 4 attempts.

@yudongusa
Copy link

BR side to modify lightening's "with num" setting to 10000 where TiDB default still keeps 100000 to ensure the accuracy of general case:

#27402

@winoros
Copy link
Member

winoros commented Nov 22, 2021

closed after 5.3 release, fixed by #28960

@github-actions
Copy link

Please check whether the issue should be labeled with 'affects-x.y' or 'fixes-x.y.z', and then remove 'needs-more-info' label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
severity/major sig/planner SIG: Planner type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants