Monday, June 23, 2008

Loose Threads

Realeyes is a somewhat complex application, both in terms of the number of components that interact with each other (4), and the complexity of those components, in particular the analysis engine/IDS, but also the database and database interface. The analysis engine and IDS are written in C, while the database interface and user interface are written in Java.

When I was planning the design of the analysis engine, I knew from the start that there would be multiple processes running. That left me with the decision of whether to use threads or interprocess communication. I know from painful experience that writing thread-safe code is hard (the TCP/IP stack written in System/390 assembler that I helped maintain). Therefore I chose to use interprocess communication. I actually had several reasons for choosing this over threads:

  • Writing thread-safe code is really hard.

  • Threads share the same address space and, while all analysis engine processes share some memory, some of them also use significant amounts of unshared memory. I was concerned that this might lead to the application running out of virtual memory.

  • For security reasons, the external interface runs under an ID that has lowered access and in a chroot jail. This means that interprocess communication would have to be used for at least this function.

  • The pcap library for capturing network traffic from the interface was going to be used, and I was pretty sure it could not be used in a threaded process.

  • I wanted to be able to control the priority of processes dynamically, and while the pthread_setschedparam man page says, "See sched_setpolicy(2) for more information on scheduling policies,", there is no man page for sched_setpolicy (I have searched the web for it).

  • Writing thread-safe code is really hard.

Long after going through this thought process, I discovered this paper by Dr. Edward A. Lee at UC Berkeley that supports my reasoning. After performing a formal analysis to prove that writing thread-safe code is really hard, Dr. Lee recommends that code be written single-threaded and then elements of threading (or interprocess communication) be added only as needed. Thank you, Dr. Lee.

This left me with the decision of which IPC techniques to use. There are essentially three:

  • Pipes

  • Message queues

  • Shared memory

I read an article about a test that compared the three (which I cannot find now) and shared memory won hands down (an order of magnitude faster, as I recall). Therefore, while pipes are used in the analysis engine to transfer small pieces of data or low priority information, shared memory is the primary mechanism.

Of course, shared memory is the most difficult to program because it requires a way of guaranteeing that the data stored in every memory location is correct at all times. This is handled in the analysis engine by all of the following methods:

  • Assigning memory locations to a single process that others cannot access

  • Using locks (or semaphores in glibc-speak) to serialize access, which means the operating system allows only the process holding the lock to access the locked memory location

  • Using a mechanism similar to locks (but without the overhead) to serialize access

The center piece of this is the memory manager. When the application starts, a single large block is allocated and made non-swappable. This means that the application never has to wait for a block to be swapped in from disk, which is not done for memory allocated by a process for its own use. This block is chopped up into pools which are in turn chopped up into buffers. (Note: This is an oversimplification, see the analysis engine slide show on the Realeyes technology page for more detail.)

The memory manager sets an "in use" flag to indicate that a buffer is being used, and clears it when the buffer is released. Each level of the analysis engine uses specific structures, and when the "in use" flag is set for a buffer, other processes are not allowed to access it unless the structure is explicitly passed to them. This is the way the first access method is implemented.

The second access method is actually used by the memory manager to obtain or release a buffer. But it is also used by processes to modify structures in memory that could be potentially modified by two processes simultaneously. Most books on programming with semaphores usually start by saying that POSIX semaphores are overly complicated. I don't disagree, but after a little experimentation, I simply wrote a set of functions to initialize, get, free, and release a single lock. As it turned out, my first attempt did not work well across all platforms where the application was tested. But the correction was basically confined to the functions, with only the addition of an index parameter to one of them that meant changing about a dozen calls in the analysis engine code.

The third access method is very much like message queues, but with the performance of shared memory. When a process has information (in a structure) to pass to another, it puts the structure on a queue that only it may add to and only one other process may remove from. The rule governing most of these queues is that the first structure in the queue may only be removed if there is another one following it. In programming terms, there must be a non-NULL next pointer. So the first process modifies the structure to be added, and the very last step is to set the pointer of the last item in the queue to the new structure's address.

Special handling is necessary for some queues. For example, if there is very little activity, a single structure could be on a queue by itself for a long time (in computer cycles). This is handled in some queues by adding a dummy structure after the one to be processed after a brief wait (maybe a hundredth of a second).

A side effect of the choice of processes over threads is that it is much easier to monitor a process than a thread. It is also quite a bit more straightforward to use a debugger on a process. So, all things considered, I recommend this over threads unless there are strong reasons against it.

Finally, I have to say that the Java code does use threads. However, they are treated like separate processes in that they don't share memory. All data is passed in arguments to the methods being called or in the method return value. This eliminates the most problematic aspects of making code thread-safe, but (I have discovered) not all of them. The other issues are memory-related, but it is memory that the application does not control, such as the window in which the application is displayed, or network connections.

Overall, I agree with Dr. Lee when he says that threads "are wildly nondeterministic. The job of the programmer is to prune away that nondeterminism." And I don't find it to be too much of a stretch when he continues that, "a folk definition of insanity is to do the same thing over and over again and to expect the results to be different. By this definition, we in fact require that programmers of multithreaded systems be insane. Were they sane, they could not understand their programs."

Later . . . Jim


Jose said...

I'm writing this reply after receiving your email this morning (at the domain).

I read the pdf on nondeterminacy problems. The argument isn't one of processes over threads. You can probably structure threads to behave essentially like processes (eg, the Java examples).

Would threads improve performance? You suggested to me that would not be the case because the kernel would handle threads and processes very similarly. I don't know, but I would expect that the threading support (or some amount of it) could be done entirely in user space saving some amount of context switching. [Hmmm, the kernel could be used to keep threads blocking on I/O from blocking the whole process, but kernel support could be avoided altogether perhaps for threads that could achieve their jobs without relying on such thread blocking calls or similarly tricky scenarios.] Of course, we also would be putting the capabilities (you mentioned priority setting), performance, and correctness of the kernel and its community against the thread libraries and its smaller community.

Interesting read, btw (here.. since I am partly familiar with realeyes now.. and in the paper linked).

I'll finish my reply directly to the email but thought this comment made sense on this page.

Jose said...

To be more specific to this article, I would say that threading might help performance if added to some parts, but avoiding using multiple processes is likely not sensible.

Maybe put some of the plugins under the same process(es). [Haven't read enough code to know if that makes sense, but it might if they switch a lot from one to the other and if threading would improve performance (and if such improvements would be noticeable). If after reading more code I still think this way, I might try to patch the code to test it.]