I’ve been slowly transitioning to using nginx as the web front-end in an effort to reduce Apache’s memory usage. In keeping with this task, I’m moving more and more off of Apache. One piece I recently moved was trac, transitioning to using it directly by nginx by running it in fast-cgi mode where as previously it was running as cgi though Apache.
While fast-cgi is faster, it has inherent issues, such as any memory leak can result in ever growing memory usage, which is exactly why Apache has a setting for each child to serve a limited number of requests before exiting. Trac.fcgi has no such directive, and has the equivalent of a large memory leak, a non-expiring cache. While it’s not as bad as a memory leak, which will indefinitely grow instead of reaching a limit, if the cache size is larger than the available memory for trac to use, it’s just as serious. The only solution, without fixing trac’s caching mechanisms, is to restart trac periodically, but during the time trac is restarting, all requests are lost, causing bad gateway errors to the user. Additionally, the restart needs to be done manually. Clearly not an ideal solution.
The ideal solution would be for the trac process to be periodically restarted, but all requests be successfully completed. This is what Apache accomplishes with its children, but trac has no such mechanism or even the support for one. So, I had to build it in myself.
First piece is to create a parent process, which holds the fcgi socket, and restarts a child trac process when it dies. This ensures that all waiting requests will be served by either the old or new process. Such a parent process absolutely must have no memory leaks, and so I created one that has only 1 explicit allocation, and it is executed only once.
The second piece, and one that’s considerably harder, is to make the trac process exit gracefully when it’s memory usage gets too big. The first step was to create a subclass of WSGIServer and override its _mainloopPeriodic to run a periodic check. In this check, I do a memory usage check, and if it’s over 90MB, set itself to exit. The problem is there’s no easy way to figure out the memory usage on linux. There is a function, getrusage, which is supposed to give resource usage information, such as memory, but linux gives all zeros (unlike a proper kernel). The only way to get this is to read the information out of /proc, and parse that data. Since this becomes a more expensive operation, I only conduct the test every 100 times.
After doing this, I was still getting periodic bad gateway errors. It turns out that trac spawns a thread to process the request, and that request hadn’t completed when the process exited, dropping the connection and causing the error. In examining the documentation, Python is supposed to wait for the thread to complete before exit. Since it wasn’t, I put in a mechanism to see if any threads are running before exit. Here lies a big problem with Python. I found out that the thread, while created, hasn’t actually started. Since it hasn’t started, it isn’t running, which is why Python exited. Furthermore, Python’s threading is so brain-dead, there seems to be no way at all to differentiate between a thread which hasn’t started and one that is exited but not freed. This means there is no reliable way to detect if all threads have exited. So instead, in order to work around Python, I created a thread-safe counter to count the number of threads. I increment it when the thread is created (not started), and decrement it when the thread completes. I then only allow the main thread to exit when this counter reaches zero (since the main thread does the allocations, this never lets the process die without starting all threads). Given this glaringly bad threading model, I put in another protection mechanism so that the main thread will exit after 30 seconds even if the count isn’t zero, just in case.
With the above two pieces, trac’s memory usage is limited, and no connections are dropped, in the time between one process deciding to go down and when it actually does, nothing is processing requests. So, the last piece is to make trac signal to the parent process that it has decided to exit, and then have the parent process launch a new trac to take over while the previous is exiting. I did this with USR1 signal, where the parent process sends a HUP to the child (in case someone else sent it the USR1 signal), and start a new child. With these modifications in place, trac has been humming along for nearly a month, being restarted about 2-3 times a day with no issues.
launcher.c – Requires the environment variable WORKER_PATH to be set to know what process to launch. Best run with something like spawn-fcgi
trac.fcgi – Modified trac.fcgi to incorporate the above mentioned changes, complete with commented code for testing/experimentation.