Monday, June 23, 2008

Loose Threads

Realeyes is a somewhat complex application, both in terms of the number of components that interact with each other (4), and the complexity of those components, in particular the analysis engine/IDS, but also the database and database interface. The analysis engine and IDS are written in C, while the database interface and user interface are written in Java.

When I was planning the design of the analysis engine, I knew from the start that there would be multiple processes running. That left me with the decision of whether to use threads or interprocess communication. I know from painful experience that writing thread-safe code is hard (the TCP/IP stack written in System/390 assembler that I helped maintain). Therefore I chose to use interprocess communication. I actually had several reasons for choosing this over threads:

  • Writing thread-safe code is really hard.

  • Threads share the same address space and, while all analysis engine processes share some memory, some of them also use significant amounts of unshared memory. I was concerned that this might lead to the application running out of virtual memory.

  • For security reasons, the external interface runs under an ID that has lowered access and in a chroot jail. This means that interprocess communication would have to be used for at least this function.

  • The pcap library for capturing network traffic from the interface was going to be used, and I was pretty sure it could not be used in a threaded process.

  • I wanted to be able to control the priority of processes dynamically, and while the pthread_setschedparam man page says, "See sched_setpolicy(2) for more information on scheduling policies,", there is no man page for sched_setpolicy (I have searched the web for it).

  • Writing thread-safe code is really hard.

Long after going through this thought process, I discovered this paper by Dr. Edward A. Lee at UC Berkeley that supports my reasoning. After performing a formal analysis to prove that writing thread-safe code is really hard, Dr. Lee recommends that code be written single-threaded and then elements of threading (or interprocess communication) be added only as needed. Thank you, Dr. Lee.

This left me with the decision of which IPC techniques to use. There are essentially three:

  • Pipes

  • Message queues

  • Shared memory

I read an article about a test that compared the three (which I cannot find now) and shared memory won hands down (an order of magnitude faster, as I recall). Therefore, while pipes are used in the analysis engine to transfer small pieces of data or low priority information, shared memory is the primary mechanism.

Of course, shared memory is the most difficult to program because it requires a way of guaranteeing that the data stored in every memory location is correct at all times. This is handled in the analysis engine by all of the following methods:

  • Assigning memory locations to a single process that others cannot access

  • Using locks (or semaphores in glibc-speak) to serialize access, which means the operating system allows only the process holding the lock to access the locked memory location

  • Using a mechanism similar to locks (but without the overhead) to serialize access

The center piece of this is the memory manager. When the application starts, a single large block is allocated and made non-swappable. This means that the application never has to wait for a block to be swapped in from disk, which is not done for memory allocated by a process for its own use. This block is chopped up into pools which are in turn chopped up into buffers. (Note: This is an oversimplification, see the analysis engine slide show on the Realeyes technology page for more detail.)

The memory manager sets an "in use" flag to indicate that a buffer is being used, and clears it when the buffer is released. Each level of the analysis engine uses specific structures, and when the "in use" flag is set for a buffer, other processes are not allowed to access it unless the structure is explicitly passed to them. This is the way the first access method is implemented.

The second access method is actually used by the memory manager to obtain or release a buffer. But it is also used by processes to modify structures in memory that could be potentially modified by two processes simultaneously. Most books on programming with semaphores usually start by saying that POSIX semaphores are overly complicated. I don't disagree, but after a little experimentation, I simply wrote a set of functions to initialize, get, free, and release a single lock. As it turned out, my first attempt did not work well across all platforms where the application was tested. But the correction was basically confined to the functions, with only the addition of an index parameter to one of them that meant changing about a dozen calls in the analysis engine code.

The third access method is very much like message queues, but with the performance of shared memory. When a process has information (in a structure) to pass to another, it puts the structure on a queue that only it may add to and only one other process may remove from. The rule governing most of these queues is that the first structure in the queue may only be removed if there is another one following it. In programming terms, there must be a non-NULL next pointer. So the first process modifies the structure to be added, and the very last step is to set the pointer of the last item in the queue to the new structure's address.

Special handling is necessary for some queues. For example, if there is very little activity, a single structure could be on a queue by itself for a long time (in computer cycles). This is handled in some queues by adding a dummy structure after the one to be processed after a brief wait (maybe a hundredth of a second).

A side effect of the choice of processes over threads is that it is much easier to monitor a process than a thread. It is also quite a bit more straightforward to use a debugger on a process. So, all things considered, I recommend this over threads unless there are strong reasons against it.

Finally, I have to say that the Java code does use threads. However, they are treated like separate processes in that they don't share memory. All data is passed in arguments to the methods being called or in the method return value. This eliminates the most problematic aspects of making code thread-safe, but (I have discovered) not all of them. The other issues are memory-related, but it is memory that the application does not control, such as the window in which the application is displayed, or network connections.

Overall, I agree with Dr. Lee when he says that threads "are wildly nondeterministic. The job of the programmer is to prune away that nondeterminism." And I don't find it to be too much of a stretch when he continues that, "a folk definition of insanity is to do the same thing over and over again and to expect the results to be different. By this definition, we in fact require that programmers of multithreaded systems be insane. Were they sane, they could not understand their programs."

Later . . . Jim

Wednesday, June 11, 2008

Elitism Improves Productivity

The Realeyes IDS application includes multiple plugins that interact with each other. The basic means of communication is a structure with information about the status of a network session, put on a queue by one plugin and taken off by the next one to process the session.

At the lowest level, this is a Data structure, which defines the packet captured by the Collector. The Data structure is then taken by the Stream Handler which determines which session it belongs to and sets some information, such as the start time, and then puts a Stream Analysis Work Element (SAWE) on another queue. The Stream Analyzers perform matching operations on the packets based on the rules defined for each one. Then the Action Analyzer and Event Analyzer perform correlation on the results of the Stream Analyzers.

This works very smoothly, except for the fact that there are multiple Stream Analyzers and one Action Analyzer. The Action Analyzer can free Data structures, and it must not free any that are still being processed. Because all of this analysis is happening asynchronously, the fields that indicate the state can change while being tested.

To handle this, I created a separate field that is set once when the session is ready for the Action Analyzer. Initially, I tried to wait briefly for the Stream Analyzers to update these fields. Of course, briefly is in the eye of the beholder. I set the wait value to 1 microsecond, which is 0.000001 second.

But the standard clock in most Intel computers is actually ticking once per 0.1 millisecond, or 0.0001 second. This is like saying, "Give me a second," and then taking over a minute and a half. The result was that work piled up waiting on the Action Analyzer. Buffers could not be freed and the application could not run for more than a couple of hours in the pilot environment.

I finally realized that instead of waiting for the first SAWE on the queue, the Action Analyzer should try to find one that was ready. In other words, it should ignore the structures that didn't meet its standards, and only choose that of the highest quality. In still other words, it should be an elitist.

And low and behold, buffer usage became almost a non-issue. The application now runs for days without running out of buffers. (In fact, it usually crashes from a bug before it runs out of buffers, but I'm working on fixing those.)

This demonstrates that being described as an elitist can be a compliment.

Later . . . Jim