Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defunct git processes when running Gitea as PID 1 on docker #19077

Closed
jswolf19 opened this issue Mar 14, 2022 · 39 comments · Fixed by #19865
Closed

Defunct git processes when running Gitea as PID 1 on docker #19077

jswolf19 opened this issue Mar 14, 2022 · 39 comments · Fixed by #19865

Comments

@jswolf19
Copy link

jswolf19 commented Mar 14, 2022

Gitea Version

1.15.7

Git Version

2.34.1

Operating System

OpenSuse Leap 15.3 (Docker)

How are you running Gitea?

I used an rpm built using the repo at https://build.opensuse.org/package/show/devel:tools:scm/gitea and installed with zypper.
I'm running in a docker container based on OpenSuse Leap 15.3, base os also OpenSuse Leap 15.3

Database

MySQL

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Description

I have zombie processes whose start time coincides with log entries such as below:
Upon updating my github login credentials, the zombie processes seem to no longer occur.
#13987 is documented as being fixed by #14006 in the 1.14.0 milestone, so I'm reporting this as a possible regression. I'm sorry I haven't been able to upgrade to the latest version to check if the issue still exists. If you need any additional information, please let me know.

2022/03/14 15:22:27 ...irror/mirror_pull.go:176:runSync() [E] Failed to update mirror repository &{21 2 Towa <nil> purchasevisualization PurchaseVisualization 購入品管理  2 https://:[email protected]/Towa-Japan/PurchaseVisualization.git main 4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 false false false true <nil> [] 0 map[] map[] [] <nil> false 0 <nil> false 0 <nil> 103641 <nil> <nil> false false [] default  1610340108 1647238881}:
        Stdout: Fetching origin

        Stderr: remote: Repository not found.
        fatal: Authentication failed for 'https://github.com/Towa-Japan/PurchaseVisualization.git/'
        error: Could not fetch origin

        Err: exit status 1

Screenshots

No response

@lunny
Copy link
Member

lunny commented Mar 14, 2022

Please upgrade to v1.16.3

@jswolf19
Copy link
Author

Please upgrade to v1.16.3

I hope to have the time soon. Does that mean that this is not an issue in 1.16.3?

@lunny
Copy link
Member

lunny commented Mar 14, 2022

If I'm not wrong, it should be.

@kiriharu
Copy link

kiriharu commented Apr 5, 2022

Same issue on 1.16.2 in Docker

@lunny
Copy link
Member

lunny commented Apr 5, 2022

Same issue on 1.16.2 in Docker

Please upgrade to v1.16.5

@jswolf19
Copy link
Author

I'm assuming I'll just be told to upgrade, but for what it's worth, the issue persists in v1.16.3. I'm also using docker, if that makes a difference.

@silentcodeg
Copy link
Contributor

If I'm not wrong, it should be.

Could you please share why you think upgrading to 1.16.5 fixes the problem?

@zeripath
Copy link
Contributor

zeripath commented May 31, 2022

The below was a misunderstanding when the title was zombie processes rather than defunct. #13987 related to stuck open git processes - whereas this case relates to processes that are dead but have not been reaped.

This should be finally fixed by #19454 and its backport #19466 which is in 1.16.7

Fixing this was extremely difficult before #19207 was merged because we were never provided with a simple reproducing case, logs or pprof data to figure out what is really happening.

@jswolf19
Copy link
Author

@zeripath thanks for the info. I think I'm on 1.16.8 now, so when I get the time, I'll verify.

@jswolf19
Copy link
Author

jswolf19 commented Jun 1, 2022

@zeripath The zombie (defunct) processes still get created at a roughly 10-20 minute interval with 1.16.8.

we were never provided with a simple reproducing case, logs or pprof data to figure out what is really happening.

If there's something I can provide, then I'm more than willing to, but I'm not sure what I can give you that will help, or how to get it. It's also not a major issue for me, because it only happens in error cases and I've just found out that running docker with the --init flag seems to at least clean up the zombies. In fact, it's entirely possible that the zombie processes are due to how git works, in which case using the --init flag may be the only solution to stop them.

However, the settings page of the mirror reports the latest update date as being the date of the last failed sync job, so it is not obvious that an error is occurring without comparing the two repos or looking at the execution logs of gitea for an error.

As best I can tell, the following steps should reproduce the issue (assuming a gitea instance running in a docker container):

  1. create (or have access to) a private repo on github.
  2. create a personal access token
  3. create a mirror in gitea using the token created above.
  4. regenerate the personal access token created above.
  5. wait for the mirror sync timeout to expire.

After the above steps, every time the mirror update job runs in gitea, a zombie git process results (sample ps output)

~ > ps -ax -O lstart | grep defunct
 3204 Wed Jun  1 17:15:11 2022 Z ?        00:00:00 git <defunct>
 3618 Wed Jun  1 16:25:11 2022 Z ?        00:00:00 git <defunct>
 9033 Wed Jun  1 16:35:11 2022 Z ?        00:00:00 git <defunct>
 9666 Wed Jun  1 09:56:35 2022 Z ?        00:00:00 git <defunct>
10932 Wed Jun  1 17:25:11 2022 Z ?        00:00:00 git <defunct>
15320 Wed Jun  1 17:35:11 2022 Z ?        00:00:00 git <defunct>
15877 Wed Jun  1 16:45:11 2022 Z ?        00:00:00 git <defunct>
20352 Wed Jun  1 17:45:11 2022 Z ?        00:00:00 git <defunct>
21406 Wed Jun  1 16:05:11 2022 Z ?        00:00:00 git <defunct>
22423 Wed Jun  1 17:55:11 2022 Z ?        00:00:00 git <defunct>
23985 Wed Jun  1 16:55:11 2022 Z ?        00:00:00 git <defunct>

and an error similar to below is generated in the log file

2022/06/01 16:55:13 ...irror/mirror_pull.go:256:runSync() [E] SyncMirrors [repo: 60:jon_s/TowaIoT]: failed to update mirror  repository:  <-- pid 23985
        Stdout: Fetching origin

        Stderr: remote: Invalid username or password.
        fatal: Authentication failed for 'https://github.com/Towa-Japan/TowaIoT.git/'
        error: could not fetch origin

        Err: exit status 1
2022/06/01 17:05:13 ...irror/mirror_pull.go:256:runSync() [E] SyncMirrors [repo: 60:jon_s/TowaIoT]: failed to update mirror  repository: <-- no zombie?
        Stdout: Fetching origin

        Stderr: remote: Invalid username or password.
        fatal: Authentication failed for 'https://github.com/Towa-Japan/TowaIoT.git/'
        error: could not fetch origin

        Err: exit status 1
2022/06/01 17:15:13 ...irror/mirror_pull.go:256:runSync() [E] SyncMirrors [repo: 60:jon_s/TowaIoT]: failed to update mirror repository: <-- pid 3204
        Stdout: Fetching origin

        Stderr: remote: Invalid username or password.
        fatal: Authentication failed for 'https://github.com/Towa-Japan/TowaIoT.git/'
        error: could not fetch origin

        Err: exit status 1

@zeripath
Copy link
Contributor

zeripath commented Jun 1, 2022

Ah so you think that these defunct processes are related to SyncMirrors? I suspect that these defunct processes are actually because of git created child processes.

Does this happen on 1.17 too? What is your version of Git?

(I wonder if it's possibly related to a version of Git?)

I have one further possible solution - we could try to set a process attribute Setpgid: true

Could you apply the below patch?

diff --git a/modules/git/command.go b/modules/git/command.go
index 3dd12e421..b3266ce99 100644
--- a/modules/git/command.go
+++ b/modules/git/command.go
@@ -13,6 +13,7 @@ import (
 	"os"
 	"os/exec"
 	"strings"
+	"syscall"
 	"time"
 	"unsafe"
 
@@ -157,6 +158,7 @@ func (c *Command) Run(opts *RunOpts) error {
 		"GIT_NO_REPLACE_OBJECTS=1",
 	)
 
+	cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
 	cmd.Dir = opts.Dir
 	cmd.Stdout = opts.Stdout
 	cmd.Stderr = opts.Stderr

and see if that prevents these defunct processes?

@singuliere singuliere reopened this Jun 1, 2022
@singuliere
Copy link
Contributor

singuliere commented Jun 1, 2022

It would help to know if the zombie processes are orphans or if their parent is the Gitea process. Running something like ps -o pid,ppid,comm,args within the container will show the PPID (parent process id) as well as the PID.

If the git process is indeed an orphan, setting the process group id as @zeripath suggests will help. The Gitea children then becomes a process group leader and killing it will also kill all its children instead of not.

The system call setpgid is used to set the process group ID of a process, thereby either joining the process to an existing process group, or creating a new process group within the session of the process with the process becoming the process group leader of the newly created group.[5] POSIX prohibits the re-use of a process ID where a process group with that identifier still exists (i.e. where the leader of a process group has exited, but other processes in the group still exist). It thereby guarantees that processes may not accidentally become process group leaders.

That being said I'd be surprised that it is the cause. A process is a zombie when the parent fails to wait(2). And it is more likely that said process is Gitea itself.

The reproducer you provide is a good one 👍 and I'll give it a try.

@zeripath
Copy link
Contributor

zeripath commented Jun 1, 2022

That being said I'd be surprised that it is the cause. A process is a zombie when the parent fails to wait(2). And it is more likely that said process is Gitea itself.

How exactly do you think it could happen that Gitea doesn't wait?

if err := cmd.Start(); err != nil {
return err
}
if opts.PipelineFunc != nil {
err := opts.PipelineFunc(ctx, cancel)
if err != nil {
cancel()
_ = cmd.Wait()
return err
}
}
if err := cmd.Wait(); err != nil && ctx.Err() != context.DeadlineExceeded {

The only way that something doesn't wait() here is if there is a panic in PipelineFunc but there's no PipelineFunc here and the panic would be logged.

Looking deeper at cmd.Wait()

From os/exec/exec.go:

func (c *Cmd) Wait() error {
	if c.Process == nil {
		return errors.New("exec: not started")
	}
	if c.finished {
		return errors.New("exec: Wait was already called")
	}
	c.finished = true

	state, err := c.Process.Wait()
	if c.waitDone != nil {
		close(c.waitDone)
	}
	c.ProcessState = state

	var copyError error
	for range c.goroutine {
		if err := <-c.errch; err != nil && copyError == nil {
			copyError = err
		}
	}

	c.closeDescriptors(c.closeAfterWait)

	if err != nil {
		return err
	} else if !state.Success() {
		return &ExitError{ProcessState: state}
	}

	return copyError
}

...

// Wait waits for the Process to exit, and then returns a
// ProcessState describing its status and an error, if any.
// Wait releases any resources associated with the Process.
// On most operating systems, the Process must be a child
// of the current process or an error will be returned.
func (p *Process) Wait() (*ProcessState, error) {
	return p.wait()
}

and os/exec/exec_unix.go:

func (p *Process) wait() (ps *ProcessState, err error) {
	if p.Pid == -1 {
		return nil, syscall.EINVAL
	}

	// If we can block until Wait4 will succeed immediately, do so.
	ready, err := p.blockUntilWaitable()
	if err != nil {
		return nil, err
	}
	if ready {
		// Mark the process done now, before the call to Wait4,
		// so that Process.signal will not send a signal.
		p.setDone()
		// Acquire a write lock on sigMu to wait for any
		// active call to the signal method to complete.
		p.sigMu.Lock()
		p.sigMu.Unlock()
	}

	var (
		status syscall.WaitStatus
		rusage syscall.Rusage
		pid1   int
		e      error
	)
	for {
		pid1, e = syscall.Wait4(p.Pid, &status, 0, &rusage)
		if e != syscall.EINTR {
			break
		}
	}
	if e != nil {
		return nil, NewSyscallError("wait", e)
	}
	if pid1 != 0 {
		p.setDone()
	}
	ps = &ProcessState{
		pid:    pid1,
		status: status,
		rusage: &rusage,
	}
	return ps, nil
}

Gitea is waiting for the process... (assuming the goroutine isn't blocked in cmd.Wait() which we know it isn't here because we've got logs.)

@singuliere
Copy link
Contributor

How exactly do you think it could happen that Gitea doesn't wait?

I have no clue. I'm betting on a bug as a more likely alternative to git spawning a child, but it's just a wild guess. I have the reproducer in place.

To be continued :-)

@zeripath
Copy link
Contributor

zeripath commented Jun 1, 2022

I guess an alternative would be that this isn't related to SyncMirrors at all and the defunct git processes are coming from something else - but what? The processes tab in /admin should show a stuck process if so - but they haven't reported this.

If they are related to the SyncMirrors call though then I really think that these have to be child processes from the git call itself. You can see that we're reading the Stdout, the Stderr and the ExitStatus from the git remote update call. Thus we have to be waiting for that process to finish.

@singuliere
Copy link
Contributor

singuliere commented Jun 1, 2022

@jswolf19 I'm unable to reproduce the problem with the steps you listed. For the record, here is what I did:

  • docker run --name gitea -p 8080:3000 -e GITEA__security__INSTALL_LOCK=true -e GITEA__mirror__MIN_INTERVAL=2m -d gitea/gitea:1.16.8
  • docker exec --user 1000 gitea gitea admin user create --admin --username root --password admin1234 --email [email protected]
  • create (or have access to) a private repo on github.
  • create a personal access token
  • create a mirror in gitea using the token created above.
  • deleted the personal access token created above.
  • docker exec gitea watch ps -o pid,ppid,comm,args
  • wait for the mirror sync timeout to expire (10 minutes)
  • forced the mirror to run from the admin panel
    image
  • forced the mirror to update from the repository settings:
    image
  • watch the logs with docker logs -f gitea and see the following when forcing the mirror from the settings (nothing when the mirror fails from the scheduled mirror task, which is what I expect):
2022/06/01 12:19:27 ...irror/mirror_pull.go:256:runSync() [E] SyncMirrors [repo: 1:root/private1]: failed to update mirror repository:
	Stdout: Fetching origin
	
	Stderr: remote: Invalid username or password.
	fatal: Authentication failed for 'https://github.com/singuliere/private1.git/'
	error: Could not fetch origin
	
	Err: exit status 1

@zeripath I did not configure the verbose logs just yet but I will as soon as I'm able to reproduce the problem, unless @jswolf19 does it first ;-)

@jswolf19 would you be so kind as to let me know what I do differently from what you do? That will probably explain why I don't see the problem and you do. I'll let the container run for a few hours but I don't see any rational reason why it would start showing zombies now.

@zeripath
Copy link
Contributor

zeripath commented Jun 1, 2022

The reported SyncMirrors call here calls git remote update [--prune] <remote_name>.

Looking at the code for builtin/remote.c in git we can see that calls git fetch [--prune|--no-prune] --multiple <remote_name>.

If the parent git remote update is killed I think it is possible that git could be creating this child in a way that would lead to a <defunct> process on systems that don't have a proper init or some automatic process grouping.

@singuliere
Copy link
Contributor

I stand corrected, your guess is better than mine 👍

@zeripath
Copy link
Contributor

zeripath commented Jun 1, 2022

@zeripath I did not configure the verbose logs just yet but I will as soon as I'm able to reproduce the problem, unless @jswolf19 does it first ;-)

docker exec -u git <gitea-container> gitea manager logging add console -n traceconsole -l TRACE -e '((modules/git)|(services/mirror))'
docker logs -f <gitea-container>

@singuliere
Copy link
Contributor

I did this:

$ docker exec -u git gitea gitea manager logging add console -n traceconsole -l TRACE -e '((modules/git)|(services/mirror))'
Added

Then clicked on the monitor to force the mirror task:

image

And looked at the logs with:

$ docker logs -n 10 gitea
2022/06/01 12:41:22 Completed POST /api/internal/manager/add-logger 200 OK in 161.111µs
2022/06/01 12:41:46 Started POST /admin for 172.17.0.1:40330
2022/06/01 12:41:46 ...rvices/cron/tasks.go:114:GetTask() [I] Getting update_mirrors in &{{0 0} update_mirrors 0xc0046a55c0 0x1e650e0 8}
2022/06/01 12:41:46 Completed POST /admin 302 Found in 2.680367ms
2022/06/01 12:41:46 ...ces/mirror/mirror.go:60:Update() [T] Doing: Update
2022/06/01 12:41:46 ...ces/mirror/mirror.go:141:Update() [T] Finished: Update: 0 pull mirrors and 0 push mirrors queued
2022/06/01 12:41:46 Started GET /admin/monitor for 172.17.0.1:40330
2022/06/01 12:41:46 Completed GET /admin/monitor 200 OK in 2.782788ms
2022/06/01 12:41:46 Started GET /assets/img/logo.svg for 172.17.0.1:40330
2022/06/01 12:41:46 Completed GET /assets/img/logo.svg 200 OK in 106.279µs

Which apparently means that there was no attempt to mirror. Shouldn't this trigger a mirror attempt?

@zeripath
Copy link
Contributor

zeripath commented Jun 1, 2022

Which apparently means that there was no attempt to mirror. Shouldn't this trigger a mirror attempt?

Not if one is not due...

@singuliere
Copy link
Contributor

Ah, this "Run" button is in case one is due but the opportunity was missed because of a restart or something? Is there a way to force a mirror task to run or should I just wait for the delay to expire?

@zeripath
Copy link
Contributor

zeripath commented Jun 1, 2022

Go to the repository itself and click the synchronize now button

@singuliere
Copy link
Contributor

That's in the mirror settings:

image

I need to set the delay to something shorter than 8h... I did try to force from the settings and it does not create a zombie. I'm now verifying if that holds true when the task does the same which is a slightly different code path. I don't think it makes a difference but I'm being thorough.

@singuliere
Copy link
Contributor

singuliere commented Jun 1, 2022

Update: Hum, maybe the conditions are not right in my test and sending the process in the background does something different. Let's scratch that.

@zeripath let say gitea runs git remote update which runs git fetch and then git remote update gets killed, leaving git fetch to be an orphan. Assuming this all happens in the 1.16.8 docker image, the orphaned git fetch would be adopted by pid 1 and waited once terminated. I verified this with:

$ docker exec gitea bash -c 'sleep 20 &'
$ docker exec gitea ps -o pid,ppid,comm,args
PID   PPID  COMMAND          COMMAND
    1     0 s6-svscan        /bin/s6-svscan /etc/s6
   16     1 s6-supervise     s6-supervise gitea
   17     1 s6-supervise     s6-supervise openssh
   18    16 gitea            /usr/local/bin/gitea web
   19    17 sshd             sshd: /usr/sbin/sshd -D -e [listener] 0 of 10-100 startups
 1682     1 sleep            sleep 20
 1683     0 ps               ps -o pid,ppid,comm,args
$ sleep 20 ; docker exec gitea ps -o pid,ppid,comm,args
PID   PPID  COMMAND          COMMAND
    1     0 s6-svscan        /bin/s6-svscan /etc/s6
   16     1 s6-supervise     s6-supervise gitea
   17     1 s6-supervise     s6-supervise openssh
   18    16 gitea            /usr/local/bin/gitea web
   19    17 sshd             sshd: /usr/sbin/sshd -D -e [listener] 0 of 10-100 startups
 1691     0 ps               ps -o pid,ppid,comm,args

So... either:

a) the zombie is not in the container and the environment is different from what we assume it is
b) the parent of the zombie process does not wait on it (which should be clarified with the output of ps)

@singuliere
Copy link
Contributor

Here is an updated and better attempt to run the reproducer, no success though.

@jswolf19 I'm unable to reproduce the problem with the steps you listed. For the record, here is what I did:

  • Run and configure Gitea to update mirrors every 10 minutes:
$ docker run --name gitea -p 8080:3000 -e GITEA__security__INSTALL_LOCK=true -e GITEA__mirror__MIN_INTERVAL=10m -d gitea/gitea:1.16.8
$ docker exec --user 1000 gitea gitea admin user create --admin --username root --password admin1234 --email [email protected]
  • create (or have access to) a private repo on github.
  • create a personal access token
  • create a mirror in gitea using the token created above.
  • deleted the personal access token created above.
  • increase the debug log level with:
$ docker exec -u git gitea gitea manager logging add console -n traceconsole -l TRACE -e '((modules/git)|(services/mirror))'
  • docker exec gitea watch ps -o pid,ppid,comm,args
  • wait for the mirror sync timeout to expire (10 minutes)
  • watch the logs with docker logs -f gitea and see the following repeating every ten minutes.
2022/06/01 13:08:39 ...irror/mirror_pull.go:256:runSync() [E] SyncMirrors [repo: 1:root/private1]: failed to update mirror repository:
	Stdout: Fetching origin
	
	Stderr: remote: Invalid username or password.
	fatal: Authentication failed for 'https://github.com/singuliere/private1.git/'
	error: Could not fetch origin
	
	Err: exit status 1
2022/06/01 13:18:38 ...ces/mirror/mirror.go:60:Update() [T] Doing: Update
2022/06/01 13:18:38 ...ces/mirror/mirror.go:141:Update() [T] Finished: Update: 1 pull mirrors and 0 push mirrors queued
2022/06/01 13:18:38 ...irror/mirror_pull.go:377:SyncPullMirror() [T] SyncMirrors [repo_id: 1]
2022/06/01 13:18:38 ...irror/mirror_pull.go:396:SyncPullMirror() [T] SyncMirrors [repo: 1:root/private1]: Running Sync
2022/06/01 13:18:38 ...irror/mirror_pull.go:202:runSync() [T] SyncMirrors [repo: 1:root/private1]: running git remote update...
2022/06/01 13:18:38 ...dules/git/command.go:146:RunWithContext() [D] /data/git/repositories/root/private1.git: /usr/bin/git -c credential.helper= -c protocol.version=2 -c uploadpack.allowfilter=true -c uploadpack.allowAnySHA1InWant=true remote get-url origin
2022/06/01 13:18:38 ...dules/git/command.go:246:RunInDirTimeoutEnv() [T] Stdout:
	 https://oauth2:[email protected]/singuliere/private1.git
	
2022/06/01 13:18:38 ...dules/git/command.go:146:RunWithContext() [D] /data/git/repositories/root/private1.git: /usr/bin/git -c credential.helper= -c protocol.version=2 -c uploadpack.allowfilter=true -c uploadpack.allowAnySHA1InWant=true remote update --prune origin
2022/06/01 13:18:39 ...irror/mirror_pull.go:256:runSync() [E] SyncMirrors [repo: 1:root/private1]: failed to update mirror repository:
	Stdout: Fetching origin
	
	Stderr: remote: Invalid username or password.
	fatal: Authentication failed for 'https://github.com/singuliere/private1.git/'
	error: Could not fetch origin
	
	Err: exit status 1
2022/06/01 13:18:39 ...irror/mirror_pull.go:256:runSync() [E] SyncMirrors [repo: 1:root/private1]: failed to update mirror repository:
	Stdout: Fetching origin
	
	Stderr: remote: Invalid username or password.
	fatal: Authentication failed for 'https://github.com/singuliere/private1.git/'
	error: Could not fetch origin
	
	Err: exit status 1
  • forced the mirror to run from the admin panel
    image
  • forced the mirror to update from the repository settings:
    image
  • watch the logs with docker logs -f gitea and see the following when forcing the mirror from the settings (nothing when the mirror fails from the scheduled mirror task, which is what I expect):
2022/06/01 12:19:27 ...irror/mirror_pull.go:256:runSync() [E] SyncMirrors [repo: 1:root/private1]: failed to update mirror repository:
	Stdout: Fetching origin
	
	Stderr: remote: Invalid username or password.
	fatal: Authentication failed for 'https://github.com/singuliere/private1.git/'
	error: Could not fetch origin
	
	Err: exit status 1

@jswolf19 would you be so kind as to let me know what I do differently from what you do? That will probably explain why I don't see the problem and you do.

@zeripath
Copy link
Contributor

zeripath commented Jun 1, 2022

Ah but are you running docker with --init. @jswolf19 said:

It's also not a major issue for me, because it only happens in error cases and I've just found out that running docker with the --init flag seems to at least clean up the zombies. In fact, it's entirely possible that the zombie processes are due to how git works, in which case using the --init flag may be the only solution to stop them.

@singuliere
Copy link
Contributor

I do not but since it was mentioned as a way to get rid of zombies... I did not follow this path. I'm unable to see zombies. This is good right?

🧟

I'll stop obsessing over this and wait for more information to get a reproducer working.

@jswolf19
Copy link
Author

jswolf19 commented Jun 1, 2022

@zeripath @singuliere Sorry for not getting back sooner. I should have made it clearer, but I'm running a container from an image built from a custom Dockerfile. I'm pretty sure the exec process is gitea (when not using the --init flag), but I don't have the Dockerile in front of me to make sure right now. @singuliere is using gitea image from dockerhub, which appears to use s6-svscan as its root process. Among other things, s6-svscan seems to do zombie reaping, which is why you shouldn't expect any zombies (passing the --init flag to docker run is basically just injecting a root process that runs a reaper and forwards signals to the entrypoint).

I don't have access to it right now, but if you'd like, tomorrow I should be able to provide a Dockerfile and run command that will work with the steps I provided earlier. If the fact that I'm not using the official docker image is enough that you don't want to pursue this issue further, I completely understand.

@singuliere
Copy link
Contributor

@jswolf19 it's actually good news! It means the fix that @zeripath proposed (making Gitea children process group leaders) will solve your problem. While your situation is certainly an unusual border case because most system will automatically adopt orphaned children, the fix will also address another problem:

  • Gitea runs process A
  • Process A runs process B
  • Gitea kills process A
  • Process B keeps running forever

zeripath added a commit to zeripath/gitea that referenced this issue Jun 1, 2022
When Gitea is running as PID 1 git will occassionally orphan child processes leading
to (defunct) processes. This PR simply sets Setpgid to true on these child processes
meaning that these defunct processes will also be correctly killed.

Fix go-gitea#19077

Signed-off-by: Andrew Thornton <[email protected]>
@zeripath zeripath changed the title Zombie git processes (likely related to #13987) Defunct git processes (likely related to #13987) Jun 1, 2022
@zeripath zeripath changed the title Defunct git processes (likely related to #13987) Defunct git processes Jun 1, 2022
@zeripath zeripath changed the title Defunct git processes Defunct git processes when running Gitea as PID 1 on docker Jun 1, 2022
@jswolf19
Copy link
Author

jswolf19 commented Jun 1, 2022

@singuliere You may be able to reproduce using the rootless Dockerfile, as this uses gitea as the root process. If you want me to provide a Dockerfile, though, I can. My current setup is using a package maintained by @ecsgh, so the rootless file may be easier to use for debugging and testing if you can reproduce with it.

@zeripath My git version in the container is git 2.36.1-lp153.546.1 (OpenSuse Leap 15.3 package). I'm using a pre-built package, so patching will be difficult, as is. If you still want me to try out this patch, then I can, but it'll take me some time (probably sometime next week at the earliest).

@singuliere
Copy link
Contributor

singuliere commented Jun 2, 2022

@jswolf19 I don't think the failed authentication causes the zombie because it goes like this:

  • Gitea runs git remote update
  • git remote update runs git fetch
  • git fetch fails and dies, git remote update waits on it and dies
  • Gitea waits on git remote update and all is clean

In order for the zombie to happen, I think it involves a mirror that timesout for some reason and forces Gitea to kill the git remote update which leaves the git fetch orphaned. It would be very useful if you could confirm this by looking at timeout events in the logs.

@jswolf19
Copy link
Author

jswolf19 commented Jun 2, 2022

In order for the zombie to happen, I think it involves a mirror that timesout for some reason and forces Gitea to kill the git remote update which leaves the git fetch orphaned. It would be very useful if you could confirm this by looking at timeout events in the logs.

@singuliere if you could help me know what to look for and where I'll check. Would the timeout events be logged by gitea, or is there some other log I'd need to check? If they're logged by gitea, do I need to increase the log output in order to get the logs for the timeout events?

@singuliere
Copy link
Contributor

@jswolf19 a detailed explanation including a reproducer based on the rootless image as you suggested 👍 was just published on the Hostea blog https://hostea.org/blog/zombies/

It all makes perfect sense now and @zeripath patch is the definitive answer to this ✨

@jswolf19
Copy link
Author

jswolf19 commented Jun 2, 2022

I appreciate the work you guy's've put into looking into this and fixing it. I will be sure to verify once I've updated ^_^

@zeripath
Copy link
Contributor

zeripath commented Jun 3, 2022

It appears that #19865 does fix this at least by my reckoning.

zeripath added a commit that referenced this issue Jun 3, 2022
When Gitea is running as PID 1 git will occassionally orphan child processes leading
to (defunct) processes. This PR simply sets Setpgid to true on these child processes
meaning that these defunct processes will also be correctly reaped.

Fix #19077

Signed-off-by: Andrew Thornton <[email protected]>
zeripath added a commit to zeripath/gitea that referenced this issue Jun 3, 2022
Backport go-gitea#19865

When Gitea is running as PID 1 git will occassionally orphan child processes leading
to (defunct) processes. This PR simply sets Setpgid to true on these child processes
meaning that these defunct processes will also be correctly reaped.

Fix go-gitea#19077

Signed-off-by: Andrew Thornton <[email protected]>
AbdulrhmnGhanem pushed a commit to kitspace/gitea that referenced this issue Aug 24, 2022
When Gitea is running as PID 1 git will occassionally orphan child processes leading
to (defunct) processes. This PR simply sets Setpgid to true on these child processes
meaning that these defunct processes will also be correctly reaped.

Fix go-gitea#19077

Signed-off-by: Andrew Thornton <[email protected]>
@mpeter50
Copy link
Contributor

mpeter50 commented Oct 26, 2022

I think this bug (or a similar one) is still present.

Recently I've restarted Gitea, it's been running for 53 minutes as of writing this, and I already have 20 zombie git processes.
All of them are children of Gitea, according to ps auxf:

[...]
root     15126  0.0  0.4 803416  7864 ?        Sl   21:21   0:00 /usr/bin/containerd-shim-runc-v2 -namespace moby -id ef721af0eca2b747b3d029e52ff4dc57d005726fd821a20b5e7904c6b014a498 -address /run/containerd/containerd.sock
1000     15207 11.7 12.3 890284 242916 ?       Ssl  21:21   6:25  \_ /usr/local/bin/gitea -c /etc/gitea/app.ini web
1000     16346  0.0  0.0      0     0 ?        Z    21:32   0:00      \_ [git] <defunct>
1000     16354  0.0  0.0      0     0 ?        Z    21:32   0:00      \_ [git] <defunct>
1000     16362  0.0  0.0      0     0 ?        Z    21:32   0:00      \_ [git] <defunct>
1000     16370  0.0  0.0      0     0 ?        Z    21:32   0:00      \_ [git] <defunct>
1000     17331  0.0  0.0      0     0 ?        Z    21:42   0:00      \_ [git] <defunct>
1000     17340  0.0  0.0      0     0 ?        Z    21:42   0:00      \_ [git] <defunct>
1000     17348  0.0  0.0      0     0 ?        Z    21:42   0:00      \_ [git] <defunct>
1000     17358  0.0  0.0      0     0 ?        Z    21:42   0:00      \_ [git] <defunct>
1000     18381  0.0  0.0      0     0 ?        Z    21:52   0:00      \_ [git] <defunct>
1000     18389  0.0  0.0      0     0 ?        Z    21:52   0:00      \_ [git] <defunct>
1000     18397  0.0  0.0      0     0 ?        Z    21:52   0:00      \_ [git] <defunct>
1000     18405  0.0  0.0      0     0 ?        Z    21:52   0:00      \_ [git] <defunct>
1000     19231  0.0  0.0      0     0 ?        Z    22:02   0:00      \_ [git] <defunct>
1000     19239  0.0  0.0      0     0 ?        Z    22:02   0:00      \_ [git] <defunct>
1000     19247  0.0  0.0      0     0 ?        Z    22:02   0:00      \_ [git] <defunct>
1000     19255  0.0  0.0      0     0 ?        Z    22:02   0:00      \_ [git] <defunct>
1000     20187  0.0  0.0      0     0 ?        Z    22:12   0:00      \_ [git] <defunct>
1000     20195  0.0  0.0      0     0 ?        Z    22:12   0:00      \_ [git] <defunct>
1000     20203  0.0  0.0      0     0 ?        Z    22:12   0:00      \_ [git] <defunct>
1000     20211  0.0  0.0      0     0 ?        Z    22:12   0:00      \_ [git] <defunct>

I've seen this in the past few days too while I was inspecting Gitea processes, but probably it has been happening for longer.

Should I open a new issue for this?

@lunny
Copy link
Member

lunny commented Oct 27, 2022

Defunct processes are processes that have terminated normally, but they remain visible to the Unix/Linux operating system >until the parent process reads their status. Once the status of the process has been read, the operating system removes >the process entries.

I think we need to check if every child process' terminal status has been read by parent process.

@mpeter50
Copy link
Contributor

Oh I think I can tie this to something.

I run several services - including Gitea - on a Raspberry Pi. I usually check it's resource utilization by bpytop. Not a very resource efficient tool, but whatever, still the best of which I tried.
Quite a few months ago I noticed that it updates the "terminal screen" much slower than usual, and it also frequently reaches 50-90% CPU utilization (100% here is for one core). I had no idea what might be the reason, until now.

I have accidentally hidden the process list, and noticed that this way bpytop can precisely keep up with the 1 second update rate.
And then it doomed on me: it is because of the defunct git processes. Bpytop processes all of them every second (every 4 seconds now, as I decreased the update rate for the process list specifically),

image

After Gitea has been running for 1 day and 21 hours, there are 970 defunct git processes on the system, out of 1167 total.
Just wanted to write in case it is useful in some way, for someone.

@go-gitea go-gitea locked and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants