Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [zeta] java.lang.OutOfMemoryError: Metaspace #4915

Closed
3 tasks done
chaorongzhi opened this issue Jun 12, 2023 · 60 comments · Fixed by #6114, #6355, #6477 or #6492
Closed
3 tasks done

[Bug] [zeta] java.lang.OutOfMemoryError: Metaspace #4915

chaorongzhi opened this issue Jun 12, 2023 · 60 comments · Fixed by #6114, #6355, #6477 or #6492
Assignees

Comments

@chaorongzhi
Copy link
Contributor

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

I will run about 15 batch synchronization tasks per minute. When I add MaxMetaspaceSize = 2g in jvm_options, OOM will appear after about 1.5 hours of running.

SeaTunnel Version

2.3.1

SeaTunnel Config

env {
    job.mode = BATCH
    execution.parallelism = 1
    checkpoint.interval = 86400000
}
source {
    Clickhouse {
        database = xxx
        password = "xxx"
        host = "xxx:xxx"
        table = xxx
        username = xxx
        sql = "select * from xxx where timeCreated > 1684891467083"
    }
}
sink {
    Clickhouse {
        database = xxx
        password = "xxx"
        host = "127.0.0.1:8123"
        table = xxx
        username = xxx
    }
}

Running Command

nohup ./bin/seatunnel.sh -c ./config/ck_ck.conf &> ./logs/ck_ck.log &

Error Exception

Exception in thread "main" org.apache.seatunnel.core.starter.exception.CommandExecuteException: SeaTunnel job executed failed
	at org.apache.seatunnel.core.starter.seatunnel.command.ClientExecuteCommand.execute(ClientExecuteCommand.java:181)
	at org.apache.seatunnel.core.starter.SeaTunnel.run(SeaTunnel.java:40)
	at org.apache.seatunnel.core.starter.seatunnel.SeaTunnelClient.main(SeaTunnelClient.java:34)
Caused by: java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: Metaspace
	at com.hazelcast.spi.impl.AbstractInvocationFuture.wrapInCompletionException(AbstractInvocationFuture.java:1347)
	at com.hazelcast.spi.impl.AbstractInvocationFuture.cascadeException(AbstractInvocationFuture.java:1340)
	at com.hazelcast.spi.impl.AbstractInvocationFuture.access$200(AbstractInvocationFuture.java:65)
	at com.hazelcast.spi.impl.AbstractInvocationFuture$ApplyNode.execute(AbstractInvocationFuture.java:1478)
	at com.hazelcast.spi.impl.AbstractInvocationFuture.unblockOtherNode(AbstractInvocationFuture.java:797)
	at com.hazelcast.spi.impl.AbstractInvocationFuture.unblockAll(AbstractInvocationFuture.java:759)
	at com.hazelcast.spi.impl.AbstractInvocationFuture.complete0(AbstractInvocationFuture.java:1235)
	at com.hazelcast.spi.impl.AbstractInvocationFuture.completeExceptionallyInternal(AbstractInvocationFuture.java:1223)
	at com.hazelcast.spi.impl.AbstractInvocationFuture.completeExceptionally(AbstractInvocationFuture.java:709)
	at com.hazelcast.client.impl.spi.impl.ClientInvocation.completeExceptionally(ClientInvocation.java:294)
	at com.hazelcast.client.impl.spi.impl.ClientInvocation.notifyExceptionWithOwnedPermission(ClientInvocation.java:321)
	at com.hazelcast.client.impl.spi.impl.ClientInvocation.notifyException(ClientInvocation.java:304)
	at com.hazelcast.client.impl.spi.impl.ClientResponseHandlerSupplier.handleResponse(ClientResponseHandlerSupplier.java:164)
	at com.hazelcast.client.impl.spi.impl.ClientResponseHandlerSupplier.process(ClientResponseHandlerSupplier.java:141)
	at com.hazelcast.client.impl.spi.impl.ClientResponseHandlerSupplier.access$300(ClientResponseHandlerSupplier.java:60)
	at com.hazelcast.client.impl.spi.impl.ClientResponseHandlerSupplier$DynamicResponseHandler.accept(ClientResponseHandlerSupplier.java:251)
	at com.hazelcast.client.impl.spi.impl.ClientResponseHandlerSupplier$DynamicResponseHandler.accept(ClientResponseHandlerSupplier.java:243)
	at com.hazelcast.client.impl.connection.tcp.TcpClientConnection.handleClientMessage(TcpClientConnection.java:245)
	at com.hazelcast.client.impl.protocol.util.ClientMessageDecoder.handleMessage(ClientMessageDecoder.java:135)
	at com.hazelcast.client.impl.protocol.util.ClientMessageDecoder.onRead(ClientMessageDecoder.java:89)
	at com.hazelcast.internal.networking.nio.NioInboundPipeline.process(NioInboundPipeline.java:136)
	at com.hazelcast.internal.networking.nio.NioThread.processSelectionKey(NioThread.java:383)
	at com.hazelcast.internal.networking.nio.NioThread.processSelectionKeys(NioThread.java:368)
	at com.hazelcast.internal.networking.nio.NioThread.selectLoop(NioThread.java:294)
	at com.hazelcast.internal.networking.nio.NioThread.executeRun(NioThread.java:249)
	at com.hazelcast.internal.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:102)
Caused by: java.lang.OutOfMemoryError: Metaspace

Flink or Spark Version

none

Java or Scala Version

1.8

Screenshots

image

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@wu-a-ge
Copy link
Contributor

wu-a-ge commented Jun 25, 2023

I also encountered this problem, is there a simple solution for it? @chaorongzhi

@chaorongzhi
Copy link
Contributor Author

I also encountered this problem, is there a simple solution for it? @chaorongzhi

Sorry, I do not have a solution at the moment and am trying to solve.

@chaorongzhi
Copy link
Contributor Author

@liugddx Hi, At the moment I reproduced the bug, but I do not know how to solve the problem, can you give me some advice.

@chaorongzhi
Copy link
Contributor Author

@liugddx Hi, At the moment I reproduced the bug, but I do not know how to solve the problem, can you give me some advice.

I got the dump file and am trying to analyze it.

@liugddx
Copy link
Member

liugddx commented Jun 29, 2023

@liugddx Hi, At the moment I reproduced the bug, but I do not know how to solve the problem, can you give me some advice.

Maybe you can adjust the jvm parameters.

@wu-a-ge
Copy link
Contributor

wu-a-ge commented Jun 29, 2023

@chaorongzhi may be zeta engize's bug,I found it's history service never delete finished jobs!!!

@liugddx
Copy link
Member

liugddx commented Jun 29, 2023

@chaorongzhi may be zeta engize's bug,I found it's history service never delete finished jobs!!!

You're right. You've been running a lot of jobs?

@chaorongzhi
Copy link
Contributor Author

@chaorongzhi may be zeta engize's bug,I found it's history service never delete finished jobs!!!

You're right. You've been running a lot of jobs?

Yes,I run 11 batch tasks per minute. The size of the metaspace does not drop after fullGC and classes are rarely unloaded.
image

@chaorongzhi
Copy link
Contributor Author

@chaorongzhi may be zeta engize's bug,I found it's history service never delete finished jobs!!!

You're right. You've been running a lot of jobs?

Can this be solved by caching the SeaTunnelChildFirstClassLoader instead of re-creating the SeaTunnelChildFirstClassLoader instance each time?
image

@liugddx
Copy link
Member

liugddx commented Jun 30, 2023

SeaTunnelChildFirstClassLoader

It should be possible, can you submit a pr to fix this problem?

@chaorongzhi
Copy link
Contributor Author

SeaTunnelChildFirstClassLoader

It should be possible, can you submit a pr to fix this problem?

Sure, I'll try.

@liugddx liugddx added the Zeta label Jun 30, 2023
@wu-a-ge
Copy link
Contributor

wu-a-ge commented Jun 30, 2023

@chaorongzhi may be zeta engize's bug,I found it's history service never delete finished jobs!!!

You're right. You've been running a lot of jobs?
I have used it in production. I have been checking the metaspace memory overflow problem. I found some memory management problems of zeta engine, and I am trying to fix them

@wu-a-ge
Copy link
Contributor

wu-a-ge commented Jun 30, 2023

@chaorongzhi may be zeta engize's bug,I found it's history service never delete finished jobs!!!

You're right. You've been running a lot of jobs?

Can this be solved by caching the SeaTunnelChildFirstClassLoader instead of re-creating the SeaTunnelChildFirstClassLoader instance each time? image

have removed SeaTunnelChildFirstClassLoader this class, put all the plug-ins in the lib directory, metaspace did rise slowly, at the same time, I have already solved the problem JobHistoryService cache data time is too long, But it still feels like metaspace memory inflation is not being solved. I am somewhat reverting to the zeta engine using hazelcast's serialization and disordering. It could be that the zeta engine is not working properly, or it could be hazelcast itself

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

@github-actions github-actions bot added the stale label Jul 31, 2023
@Bingz2
Copy link
Contributor

Bingz2 commented Aug 23, 2023

The version 2.3.2 I used also encountered the same problem, may I ask if this problem was fixed in version 2.3.3?

@wu-a-ge
Copy link
Contributor

wu-a-ge commented Sep 15, 2023

@chaorongzhi Is there any progress on this issue?

@wu-a-ge
Copy link
Contributor

wu-a-ge commented Jan 1, 2024 via email

@liugddx
Copy link
Member

liugddx commented Jan 1, 2024

use g1gc

image image

@wangzhiwei61
Copy link

Excuse me, can I just configure g1 in jvm-optioal in 2.3.3-release? @liugddx

@wangzhiwei61
Copy link

1.version
2.3.3
JDK1.8
2.config
config g1
截屏2024-01-08 10 39 27

---result---
截屏2024-01-08 10 35 18

@liugddx sir,What should I do? Can you give me some advice?

@gguo
Copy link

gguo commented Jan 10, 2024

@wu-a-ge @chaorongzhi Hi , I was running into the same issue with 2.3.3. And I also tried to remove SeaTunnelChildFirstClassLoader this class. But it seems not work. Could you please gvie me some instructions to solve the issue?

@Hisoka-X
Copy link
Member

Hisoka-X commented Mar 4, 2024

After I checked the log, the classloader created log not existed.


image
That's meaning no new classloader be created in server side.

Could you provide your full log which after server started. It should be not only one log file. Or started with debug mode. @W-dragan

@wu-a-ge
Copy link
Contributor

wu-a-ge commented Mar 7, 2024

@W-dragan 你不是之前测试没问题内存溢出问题了么?怎么又出现了?只有靠你了!
@Hisoka-X @liugddx From 2.3.2-2.3.4 two versions have not solved this memory overflow problem, this stability will not be solved will make many people abandon it.

@liugddx liugddx reopened this Mar 7, 2024
@Hisoka-X
Copy link
Member

Hisoka-X commented Mar 7, 2024

Hi @wu-a-ge . We try to fix it with PR #6355. It not including in 2.3.4 but in dev. We will release it in next version. You can try use dev branch to test new feature. It will be helped a lot if you can.

@W-dragan
Copy link

W-dragan commented Mar 7, 2024

metaSpace.txt
I currently do not have time to verify and provide you with the logs of the previous batch running tasks. I will provide them to you later. However, today I discovered a new metaSpace issue, which is that the configuration I provided is 3c6G, Xms3g, and Xmx3g

Running 22 CDC tasks simultaneously will also generate Java. lang. OutOfMemoryError: Metaspace

The configuration is also very simple, it should be easy to replicate simply from mysql-CDC to pg
Due to certain reasons, we can only provide you with partial logs
@Hisoka-X

@Hisoka-X
Copy link
Member

Hisoka-X commented Mar 7, 2024

Due to certain reasons, we can only provide you with partial logs

maybe you can do some data desensitization.

Running 22 CDC tasks simultaneously will also generate Java. lang. OutOfMemoryError: Metaspace

It's lots of task if heap size is 3g. Does 22 CDC use same source and same sink? What's the parallelism value?

@W-dragan
Copy link

W-dragan commented Mar 7, 2024

@Hisoka-X Yes, it's the same source and sink, but it's divided into 22 tasks to execute, each with a parallelism of 1, such as source mysqltable1 to sink pgtable1 Source mysqltable22 to sink pgtable22

@Hisoka-X
Copy link
Member

Hisoka-X commented Mar 7, 2024

@Hisoka-X Yes, it's the same source and sink, but it's divided into 22 tasks to execute, each with a parallelism of 1, such as source mysqltable1 to sink pgtable1 Source mysqltable22 to sink pgtable22

Please provide us with complete desensitization logs when cache-mode is turned on. Thanks.

@W-dragan
Copy link

W-dragan commented Mar 7, 2024

classloader.txt

org.apache.seatunnel.engine.server.service.classloader.DefaultClassLoaderService#getClassLoader
@OverRide
public synchronized ClassLoader getClassLoader(long jobId, Collection jars) {
log.info("Get classloader for job {} with jars {}", jobId, jars);
if (cacheMode) {
// with cache mode, all jobs share the same classloader if the jars are the same
jobId = 1L;
}
if (!classLoaderCache.containsKey(jobId)) {
classLoaderCache.put(jobId, new ConcurrentHashMap<>());
classLoaderReferenceCount.put(jobId, new ConcurrentHashMap<>());
}
Map<String, ClassLoader> classLoaderMap = classLoaderCache.get(jobId);
String key = covertJarsToKey(jars);
if (classLoaderMap.containsKey(key)) {
log.info("use exist classloader for job {} with jars {}", jobId, jars);
classLoaderReferenceCount.get(jobId).get(key).incrementAndGet();
return classLoaderMap.get(key);
} else {
ClassLoader classLoader = new SeaTunnelChildFirstClassLoader(jars);
log.info("Create classloader for job {} with jars {}", jobId, jars);
classLoaderMap.put(key, classLoader);
classLoaderReferenceCount.get(jobId).put(key, new AtomicInteger(1));
return classLoader;
}
This is a part of the code that I have modified, mainly by changing the log level and adding logs to facilitate the analysis of which branch the code went to. Based on the above logs, I found that the classLoader caching mode was indeed enabled, and the logic for code branch entry was correct. However, the classLoader did continue to grow,

2024-03-07 17:00:35684 INFO org. apache. seatunnel. engine. common. loader ClassLoaderUtil - recycle classloader org.apache.seatunnel.engine.common.loader.SeaTunnelChildFirstClassLoader@7dc18237

Suspected to be recyclable

@Hisoka-X

@Hisoka-X
Copy link
Member

Hisoka-X commented Mar 7, 2024

@W-dragan how do you submit job? Http or shell?

@W-dragan
Copy link

W-dragan commented Mar 7, 2024

@Hisoka-X http

@Hisoka-X
Copy link
Member

Hisoka-X commented Mar 7, 2024

Oh I see. This is a bug of http submit job. In

will create one ClassLoader every time submit job. In shell mode, this part code executed on client side. So the classloader will be close when client shutdown. But in http mode, this part code executed on master server side. So it can not be recycle at now. You can use shell to submit job to avoid this bug. Also, I will fix it this week. Thanks for your help! @W-dragan .

@Hisoka-X
Copy link
Member

Hisoka-X commented Mar 7, 2024

Oh I see. This is a bug of http submit job. In

will create one ClassLoader every time submit job. In shell mode, this part code executed on client side. So the classloader will be close when client shutdown. But in http mode, this part code executed on master server side. So it can not be recycle at now. You can use shell to submit job to avoid this bug. Also, I will fix it this week. Thanks for your help! @W-dragan .

cc @liugddx

@W-dragan
Copy link

W-dragan commented Mar 7, 2024

Due to framework design reasons, the currently unified HTTP mode cannot be switched to shell mode, and it is unlikely that shell mode will be used in the future. If you fix it, please kindly @ me. Thank you very much
@Hisoka-X @liugddx

@Hisoka-X Hisoka-X self-assigned this Mar 7, 2024
@liugddx
Copy link
Member

liugddx commented Mar 8, 2024

Due to framework design reasons, the currently unified HTTP mode cannot be switched to shell mode, and it is unlikely that shell mode will be used in the future. If you fix it, please kindly @ me. Thank you very much @Hisoka-X @liugddx

Let me see~

@Hisoka-X
Copy link
Member

Hi @W-dragan . Could you try again with #6477 ?

@W-dragan
Copy link

errorlog.txt
I used code 6477 to submit the job using HTTP in cluster mode, but encountered a null pointer exception


It should be that the node obtained is not the master node, but there was no subsequent master node judgment, which caused the problem to occur
@Hisoka-X

@liugddx
Copy link
Member

liugddx commented Mar 12, 2024

errorlog.txt I used code 6477 to submit the job using HTTP in cluster mode, but encountered a null pointer exception

It should be that the node obtained is not the master node, but there was no subsequent master node judgment, which caused the problem to occur
@Hisoka-X

My fault, I didn't check whether seaTunnelServer is empty.

@Hisoka-X
Copy link
Member

errorlog.txt I used code 6477 to submit the job using HTTP in cluster mode, but encountered a null pointer exception

It should be that the node obtained is not the master node, but there was no subsequent master node judgment, which caused the problem to occur
@Hisoka-X

My fault, I didn't check whether seaTunnelServer is empty.

All node have seatunnel server, only difference are it is master node or not.

@W-dragan
Copy link

errorlog.txt I used code 6477 to submit the job using HTTP in cluster mode, but encountered a null pointer exception

It should be that the node obtained is not the master node, but there was no subsequent master node judgment, which caused the problem to occur
@Hisoka-X

My fault, I didn't check whether seaTunnelServer is empty.

All node have seatunnel server, only difference are it is master node or not.


But in this method, if it is not a master node, it will return null, and I remember it was intentionally written here to solve # 6217

@liugddx
Copy link
Member

liugddx commented Mar 12, 2024

errorlog.txt I used code 6477 to submit the job using HTTP in cluster mode, but encountered a null pointer exception

It should be that the node obtained is not the master node, but there was no subsequent master node judgment, which caused the problem to occur
@Hisoka-X

My fault, I didn't check whether seaTunnelServer is empty.

All node have seatunnel server, only difference are it is master node or not.

But in this method, if it is not a master node, it will return null, and I remember it was intentionally written here to solve # 6217

Test #6492

@W-dragan
Copy link

JobImmutableInformation jobImmutableInformation = restJobExecutionEnvironment.build();
Long jobId = jobImmutableInformation.getJobId();
if (seaTunnelServer == null) {
NodeEngineUtil.sendOperationToMasterNode(

I made the changes according to #6492, but I think if this is the case, the line of code I marked should also be changed, and similar logic in the RestHttpGetCommandProcessor class may also need to be changed uniformly

@liugddx
Copy link
Member

liugddx commented Mar 12, 2024

RestHttpGetCommandProcessor

I unified the method, please check again.

@W-dragan
Copy link

W-dragan commented Mar 12, 2024

But after my verification, I found a phenomenon that may be related to cluster deployment. I just repeatedly submitted two duplicate types of tasks using HTTP, one batch and one CDC. In the end, the tasks were all completed, and CDC called stop to terminate. Theoretically, there should only be two classloaders, and they should be released when the tasks were completed. However, I have found that after all the tasks have ended, There are still three instances of SeaTunnelChildFirstClassLoader. Of course, even if repeated submissions are made, no new SeaTunnelChildFirstClassLoader instances are generated, which should have solved the problem of OOM to some extent. However, this phenomenon still confuses me at present.
@Hisoka-X @liugddx

@Hisoka-X
Copy link
Member

Theoretically, there should only be two classloaders, and they should be released when the tasks were completed.

No, one job not only one type classloader at now. It will including source classloader, sink classloader, and source and sink classloader.

I have found that after all the tasks have ended, There are still three instances of SeaTunnelChildFirstClassLoader.

Yes. In cache mode, we store classloader in the memory so we can reuse it in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment