Realeyes Technology: code

Showing posts with label code. Show all posts

Monday, November 8, 2010

FOSS GUI IDEs

A question was recently posed on my LUG's mailing list asking what advantages there are to using a GUI IDE over a terminal based editor, such as vim or emacs. Here is my response:

Probably the main difference between terminal editors and GUIs is that the terminal editors require you to use a key sequence to perform tasks while the GUIs automagically display information in popups. If you have spent enough time to learn the key sequences, then (IMO) not having to move your hand between the keyboard and mouse gives terminal editors the advantage. Also, different GUIs provide different capabilities, and none of the ones I have used (see below) are a clear winner over the others.

These features include:

Line collapsing
Auto-completion of function and variable names
Popups for functions showing the arguments, including data type
Popups showing the values of macro definitions
Popups showing the documentation for functions/classes (a man page in a popup)

Theoretically, the GUIs also offer better debugging capabilities. However, I am more comfortable with command line gdb. To me the GUIs are way too busy and I find the command line more flexible (p *mystruct or x/4x 0x80000c00 as opposed to a 4 click minimum and at least 10 to find the menu for doing anything other than displaying local variables).

However, there are some things that (AFAIK) GUIs can do that terminal editors can't. For example, when a compile has errors, being able to click on a compile error message and go straight to the flagged line in the source. (If anyone tells me that this is a simple matter of using ctags, then I demand you provide a simple explanation of how to use ctags to do this and all the other wonderful things that ctags claim to provide! Otherwise, I will continue to believe that ctags are a waste of time and resources.)

I have used Anjuta (which is written in C/C++ and uses GTK), Eclipse, and lately the Nokia SDK for Qt development. Here are my impressions:

Eclipse: Don't add C plugins to Eclipse, get the Ganymede Eclipse IDE for C/C++. I tried going the first path and it was ugly. Beyond that, yes Eclipse is best at Java, but this version is pretty good at C/C++. When the cursor is on a variable or function/class, all other instances of it within the current scope are highlighted. If you know your way around Eclipse, configuration is easier that the others.

Anjuta: What I love about Anjuta is that it built my autoconf and automake files for me. I was then able to tweak those to handle my custom requirements. If you have ever battled ac/am you know that every bit of help is welcome. Unfortunately, when I upgraded to a newer version, a lot of things changed. My older project files didn't quite work, so I spent a lot of time fixing those. Editor functions were different. The compile errors wouldn't go to the source code.

Nokie Qt SDK: This includes a designer for GUI (Qt specific) interfaces. Having used (and been spoiled by) the Google Android plugin for Eclipse and its GUI designer, I am not all that impressed, but it's better than not seeing the layout until you get your source compiled and debugged. Using the IDE for editing C/C++ is actually quite pleasant, and I am considering using it for future projects.

I have looked at KDevelop, but don't have any projects using it. Maybe the version I have is old, but so far it feels clunkier that the others.

Later . . . Jim

Thursday, July 16, 2009

Managing Realeyes Memory

I have been at work on a new project that I hope to announce soon. But at the moment I need a break to let the algorithm for a particularly tricky function to germinate. So I am going to describe how I do memory management in the Realeyes IDS.

First, I have to say that this is not a generic memory manager. It is specific to my application. It may be possible to adapt it to other applications, but the key word here is 'adapt'. However, it will hopefully give anyone who is considering doing their own memory management some food for thought.

The reason I do my own memory management is to avoid fragmentation. The Realeyes IDS manages a lot of sessions simultaneously, so memory has to be used efficiently as possible. If a buffer were allocated for exactly the size of a packet's data, the overall buffer space would develop lots of pockets of unusable space. But to set the size of data buffers to the largest allowed by the Internet Protocol would also be inefficient, because there are a huge number of tinygrams in network traffic.

The solution is a compromise. I allocate fixed size buffers in sizes that are designed to waste as little space as possible. The smallest is 64 bytes. The next is 105 bytes, and the next, 128 bytes. So if 56 bytes are needed, the first size buffer is allocated. If 65 bytes are needed, the second. And if 108 bytes are needed, the third.

If you did a double take on the 105 byte buffer size, there is a method to my madness. These buffers are kept in pools of 8 Kilobytes. A 64 byte buffer pool will hold 256 buffers, and a 128 byte buffer pool will hold 64 buffers. Both of these will fit exactly into the 8K pool with no wasted space. To fine tune this a bit, I found the buffer size between them that wastes the least space. Allocating 78 buffers of 105 bytes each uses 8190 bytes, which wastes 2 bytes of space in an 8K pool.

Here is the complete list of buffer sizes: 64, 105, 128, 195, 256, 390, 512, 744, 1024, 2048, 4096, 8192. If larger sizes are needed, multiple adjacent pools are allocated, up to 64K. Again, this is specific to the Realeyes IDS application, which can guarantee that no buffer larger than 64K will ever be requested.

When the application is initialized, a huge buffer (many megabytes) is allocated and it is divided into 8K pools. Then, when a buffer is requested, if there is no pool for the appropriate buffer size already selected, the next available pool is assigned to provide buffers of that size only, and the first buffer in the pool is returned to the requester. If a pool already exists for the buffer size and has free buffers, a buffer from that pool is returned.

To handle the requests, each allocated pool is kept on one of three queues for that buffer size. The reason there are three is that the entire queue must be locked while buffers are being allocated or freed. The Realeyes IDS functions to handle semaphores allow for a caller to request an immediate return if a lock is already held. This means that if one of the pool queues is in use, the caller can try the next one. The rae_mem_mgmt.c module keeps track of the last queue accessed and uses a simple round robin method to check the next queue.

So far, so good. But there are still some loose ends. What if all of the buffers in a pool are in use, and that pool is at the head of the pool queue? For that matter, what if the first 1,000 pools in a queue have no available buffers? This is where the manager comes in.

For each pool size there are Full and Free queues. The manager periodically (about 500 times a second) checks each of the available queues, removes all pools that have no available buffers, and puts them on the Full queue for that buffer size. Pools on the Full queue that have available buffers are put on the Free queue. And pools that are on the Free queue are divvied up so that each available queue has approximately the same number of pools.

There are a few other steps in managing the queues, which is done in rae_mem_ctl.c. If an allocated pool has all of its buffers freed, it is put on a general available queue to be reused for possibly a different buffer size. Also, there is a queue at each buffer size for full pools that have not had available buffers for a period of time. This is only checked once a second to see if buffers have been freed.

So does it work? In the pilot project I have been running for over a year, the statistics I have collected show that the average number of sessions being managed simultaneously is around 20,000. Assuming an average size of 16K per session, that is 325M of data, plus the application data about each session. And that is just a snapshot. There are many Gigabytes of data being examined over the course of an hour. When the IDS does run out of buffers (I'm working on it, OK?!?), it is because the application hasn't released them when it should, not because the memory management is bogging down.

So that's the essence of how memory management is handled in the Realeyes IDS. However, because the application uses multiple processes instead of threads, the memory must be shared. I will cover that in a future post.

Later . . . Jim

Monday, June 1, 2009

Handling Semaphores in C Programs

A while back, Carlo Schroder, over at LinuxToday.com, put out a request for articles on programming. Now that I have put the downloads for Realeyes IDS version 0.9.5 up on SourceForge, I get to have some fun answering her call.

What I have found in programming books, including those by the late W. Richard Stevens (which I turn to most often) is usually a good start, but never the whole story. But since this is not a general programming text, I will focus on a single issue in detail. This post will cover using semaphores.

First, a little background on locks, in case you have never used them. In *nix, process locks are implemented with the semaphore system calls. Since I use child processes that share memory, I have to implement semaphores. Threads use pthread_mutex calls, which do essentially what these functions do, and then some.

The most common reason for implementing locks is if you have multiple concurrently running processes or threads that have access to the same variable in memory. Obtaining or releasing a lock is guaranteed by the operating system to be completed without interruption. A single line of C code, such as

 if (flag & 4)

requires a minimum of two machine instructions:

Get the value of flag from memory into a register

Compare the value to zero

It is possible the thread running that code could be swapped out after getting the value, but before comparing it, and the value of flag could be changed by another thread, making the comparison invalid. By requiring every thread to get the flag lock before reading or writing the value of flag, only one thread accesses flag at a time.

The rule of thumb for the code while a lock is held is to do only what requires holding the lock, and no more. Often there are less than ten instructions between getting and releasing a lock. However, sometimes there are a couple dozen, because all of them require holding the lock. My memory management is an example of this, and I will try to cover it down the road.

OK, now for how I implement semaphores. For some reason, the caller is required to define the semun structure. This definition is taken from the semctl man page and is in the rae_lock_mgmt.h file.

union semun {
 int                val;    /* Value for SETVAL */
 struct semid_ds    *buf;   /* Buffer for IPC_STAT, IPC_SET */
 unsigned short int *array; /* Array for GETALL, SETALL */
 struct seminfo     *__buf; /* Buffer for IPC_INFO */
};

All of the following code is from the rae_lock_mgmt.c file. First, I define a global to keep track of a held lock. This is done so that if there is an interrupt, such as a segmentation fault, while the lock is held, it can be released in the signal handler by calling rae_release_lock. The caller must set these fields to the address of the variables used to track the semaphore ID and index.

/* Pointer to currently held lock */
int *rae_held_lock = NULL;
int *rae_hl_index = NULL;

Before a lock can be used, it must be initialized. I use a lot of locks and have found that different Linux distributions have different defaults for the maximum number of semaphores an application may allocate. To keep the number of allocations down, I have grouped the locks by functionality, and each group gets a semaphore set (or array) which only uses a single semaphore ID. Therefore, the number of locks in the group is passed to the init function.

int rae_init_lock(int il_size)
{
 int i, il_semid = -1;
 union semun sem_union;

The semget call returns a new semaphore set. Then each lock in the array is initialized to a value of 1 to indicate that it is available. If that fails, the semaphore ID is set to a negative value, which is -1 if the semaphore set is released, and -2 if it is not.

  if ((il_semid = semget(IPC_PRIVATE, il_size, (0660|IPC_CREAT))) > 0)
 {
   sem_union.val = 1;
   for (i=0; i < il_size; i++)
   {
     if (semctl(il_semid, i, SETVAL, sem_union) == -1)
     {
       if (semctl(il_semid, 0, IPC_RMID, sem_union) == -1)
         il_semid = -2;
       else
         il_semid = -1;
     }
   }
 }
 return(il_semid);
} /* end rae_init_lock */

When the application shuts down, the locks must be freed. If not, they remain allocated by the system. You can run 'ipcs -s' to see the locks that are held. If an application fails to relaese a lock, you can run ipcrm (read the man page) as root to release it.

Notice that the location of the semaphore ID in the application is passed to this function, and it is set to zero by the function. This is because the sem_ctl command, IPC_RMID, ignores the index and simply removes the entire semaphore. Also, I prefer to do as much as possible in a function so the caller does not have to worry about the details. This way, when I call the same function from different places, I reduce the risk of forgetting to set something.

int rae_free_lock(int *fl_semid, int fl_idx)
{
 int fl_stat = 0;
 union semun sem_union;

 if (*fl_semid == 0)
   goto out;
 sem_union.val = 1;
   fl_stat = semctl(*fl_semid, fl_idx, IPC_RMID, sem_union);
 *fl_semid = 0;
out:
 return(fl_stat);
} /* end rae_free_lock */

When a lock is needed for a memory location, the get lock function is called with the lock identifier, which consists of the semaphore ID and the index in its array. I have added a wait flag to allow some locks to be conditional. In my memory management code, I have three available buffer queues, and if the lock on one is held, the caller can simply try the next one without waiting.

The semop call gets and releases a lock by subtracting the supplied value from the semaphore's value. What this means is that you, the programmer, are responsible for defining what value indicates a held or released lock. By keeping all of this logic in a pair of functions, you have control over how it is implemented. All of the examples I have seen use 1 to indicate the lock is available, and 0 to indicate that it is held. I can imagine how other values might be used, but it seems ridiculously complicated and prone to error.

int rae_get_lock(int gl_wait, int gl_semid, int gl_idx)
{
 int gl_stat = 0;
 struct sembuf sem_b;

 if (gl_semid == 0)
 {
   gl_stat = -2;
   goto out;
 }

The semaphore buffer structure is defined in the system headers. This is where the semaphore array index and operation are set. And as I said before, this function allows the caller to wait or not, which is accomplished by using the semaphore flag bit, IPC_NOWAIT.

The SEM_UNDO flag bit is supposed to reverse the operation when the process terminates, which implies that if the process fails while holding a lock, the system will release it (but not free it). However, in my experience, that doesn't always work, so I have included the capability to do this in my interrupt handlers, as I mentioned above.

  sem_b.sem_num = (short int) gl_idx;
 sem_b.sem_op = -1;
 if (gl_wait & raeLOCK_WAIT)
   sem_b.sem_flg = SEM_UNDO;
 else if (gl_wait & raeLOCK_NWAIT)
   sem_b.sem_flg = SEM_UNDO | IPC_NOWAIT;
 else
 {
   gl_stat = -1;
   goto out;
 }

This is the heart of the function. Read the semop man page for a lot more detail, but the general idea is as follows. The semop system call will attempt to subtract 1 from the current value of the lock. If that value is 1, the operation occurs immediately. Otherwise, the call will wait or return with the errno value of EAGAIN if the lock is held. Of course, there is the possibility the call will fail entirely, which must be handled.

If the lock value is set to 0, this means the lock is obtained, and this function sets the semaphore ID and index in the global lock tracking variables.

  if ((gl_stat = semop(gl_semid, &sem_b, 1)) == -1)
 {
   if ((gl_wait & raeLOCK_NWAIT) && errno == EAGAIN)
     gl_stat = 1;
   else if (errno == EIDRM)
     gl_stat = -2;
 }
 if (!gl_stat && rae_held_lock != NULL)
 {
   *rae_held_lock = gl_semid;
   *rae_hl_index = gl_idx;
 }
out:
 return(gl_stat);
} /* end rae_get_lock */

This is the reverse of the get lock function, in that it adds one to the lock value. There is no wait flag for releasing a lock, so only the semaphore ID and index are supplied. If the lock value is set to 1, the lock is released and this function clears the semaphore ID and index in the global lock tracking variables.

int rae_release_lock(int rl_semid, int rl_idx)
{
 int rl_stat = 0;
 struct sembuf sem_b;

 sem_b.sem_num = (short int) rl_idx;
 sem_b.sem_op = 1;
 sem_b.sem_flg = SEM_UNDO;
 rl_stat = semop(rl_semid, &sem_b, 1);
 if (!rl_stat && rae_held_lock != NULL)
 {
   *rae_held_lock = 0;
   *rae_hl_index = 0;
 }
 return(rl_stat);
} /* end rae_release_lock */

This set of functions makes using semaphores as easy as:

Init lock

Get lock

Release lock

Free lock

Of course, the caller code must be well thought out to prevent a deadly embrace. That is accomplished by keeping the code using the Get and Release calls as simple as possible, and making sure the instructions between them absolutely require the lock.

Later . . . Jim

Wednesday, April 29, 2009

Command Line vs. GUI Reality Check

I was reading Kyle Rankin and Bill Childer's Point/Counterpoint column on "Mutt vs. Thunderbird" in Linux Journal over breakfast the other day. It mostly boiled down to the perennial text vs. graphical user interface argument. And since I don't have a strong opinion about either mail client, it got me to thinking about the real difference between the two interfaces.

Before I dive into the firepit, let me explain that I spent the first half of my career in the IBM mainframe world, which meant writing Job Control Language (JCL) to submit batch jobs. Part of that time, I spent maintaining a TCP/IP stack written in assembler language. Compared to that, the distance between the command line and the GUI is much smaller than their advocates seem to realize.

To be honest, I didn't use GUIs for a quite a while after they became available. I didn't fight them so much as I just didn't find them intuitive or efficient. However, as usability studies improved the interfaces and processor speeds increased their capabilities, I gradually came to appreciate what they provide.

I think that much of the bias in favor of GUIs comes from the saying, "A picture is worth a thousand words." And certainly, pictures of cityscapes or family reunions or battlefields convey much more information than even the best description. However, imagine having to draw a picture to say, "I'm thirsty," or, "Turn left at the next light," or even to describe how to use a graphical user interface to a first time user. Very often, a few words are more efficient than any picture.

Obviously, each interface has its strengths and weaknesses. So what I was wondering, as I read Bill and Kyle, was whether it is possible to generalize the strengths of each. I'm not suggesting that I have it completely figured out, but I think there are, in fact, some common elements for the situations in which each is superior to the other.

To begin with, I am assuming that the interface in question has gone through enough testing and real-world use to be very good at what it is supposed to do. The Linux/Unix command line's many utilities that can be piped or scripted to create miniature applications is the best example of that interface. And I prefer the GNU collection because of the extra arguments that have been added by developers scratching an itch. Meanwhile, the current state of the art web browsers are good examples of GUIs.

Command line interfaces are best for data that is concise and reasonably well known in advance. For example, comparing the dates of two copies of the same file in different directories:

ls -l /a/b/c/file /d/e/f/fileOr finding information about running processes:

ps aux | grep programOr starting and stopping daemons:

The last two examples are mostly used by administrators, and I think that the command line is especially appropriate for that type of work. However, as a developer, I find that the most repetitive tasks I perform are done so most efficiently on the command line. And I personally almost never use a file browser, preferring the flexibility of the command line.

Graphical user interfaces are best for displaying somewhat large amounts of data that needs to be analyzed to determine what to do next. When a web page is loaded in a browser, it is usually not known in advance what link(s) will be selected. A window with a helpful design to guide the user to the proper selection and a mouse pointer that can be moved directly to the appropriate target is a highly effective use of this interface.

With this in mind, I started thinking about the use of both interfaces in the Realeyes IDS.

The installation is performed by executing several shell scripts on the command line. Many of the responses have a default, which allows the Enter key to be pressed for a majority of the responses. And where that is not possible, simple information, such as an IP address or a single name, is entered. To begin with, a graphical environment is not required. And second, there is no switching between the keyboard and the mouse. This makes deploying a new sensor or updating the application very fast after the first couple of times.

In contrast, the analysis window displays large amounts of information that may not be selected in a sequential order. Having the data in a scrollable list allows me to focus on one column, while watching another with my peripheral vision. This allows me to see patterns that might not be apparent in a text display.

Another advantage of the GUI comes when I want to view the session data. This is done by right clicking on a report and selecting Playback from the popup menu displayed. When the reports are sorted by Event, all of those with the same Event are similar enough that a quick glance at the session data is sufficient to determine whether it warrants more attention. Then a single click closes the window. This means that I can often rip through reports, spending only 2 or 3 seconds on most of them.

The GUI also provides tabs that allow me to search for trends or get summary reports without losing my focus on the original display. And the status frames automatically notify me if there is something that needs attention, without my having to query for it.

There are some administrative functions that could be performed from the command line as easily as the GUI, and in the early versions of the Realeyes IDS, it was the only way to do that. However, having them incorporated into the GUI is very convenient.

The downside of this is a lack of flexibility. In order for a capability to be available, there must be code in the GUI application. The command line gives an administrator complete control of maintenance procedures, and under certain circumstances, this is the only option.

From a design perspective, the choice of command line vs. GUI seems pretty straightforward. First, how quickly does the code need to be produced? Second, which interface makes the user most productive? While there is plenty of room for different points of view on the answers to these questions, it is simply not true that one is always better than the other.

Later . . . Jim

Tuesday, March 17, 2009

Database Security

I saw the following on the webappsec list at Security Focus:


| I've heard this preached before.
|
| Using JDBC properly can help protect against SQL Injection.
|
| What protections does JDBC provide?
|
| Does java encode the input to not be malicious?
|
| I'm curious where in the java source/libraries does jdbc help
| to mitigate malicious input when using jdbc.
|
This preach is applicable for any programming language. It
all depends on how well you have done input & output
validation. As in what input you expect & what input is
malicious for your app. If all goes well you can make SQL
injection very difficult or even impossible . The reason I
say difficult, because it all depends on how well the SQL
injection is crafted. As far as I recollect I don't think
JDBC or for that case even java gives you predefined class
for doing that. But there is quite a possibility that some
one on the internet must have surely written these classes.
--
Taufiq
http://www.niiconsulting.com/products/iso_toolkit.html

I don't disagree with Taufiq's assessment. However, I do disagree with his acceptance of the status quo. I wrote a rant on this blog responding to a complaint that security professionals are not taken seriously. In it, I pointed out that the security industry should promote improving the security climate, not just react to it with solutions 'for a price'. The example I gave was *DBC libraries.

The JDBC package, java.sql, does not supply any security parsing. This is not the real workhorse, but it should at least provide a method for this. Each database supplies a jar that java.sql classes call to access the specific database. This is where security parsing must be handled.

The thing is that parsing input is tricky. The first step is to validate that the input is correct for the column data type. This is reasonably straightforward for simple types like integer and varchar. But the way different databases support binary data and very large fields is not consistent. There is also support for non-standard data, such as PostgreSQL's support for the inet data type.

The JDBC Connection interface includes the getMetaData method, which returns the information supplied by the specific database library, some of which is unique to that database. There are not only differences between databases, it is even possible that there are differences between versions of the same database. This could be an issue for an application because:

All unique information must be verified for every version of the database supported. And if you are supporting multiple databases, it is that much more difficult.

The next step is to escape all characters that have special meaning, such as single quote and backslash. But again, each database has its own special characters that must be accounted for, such as ampersand in Oracle, and the E'...' escape pattern in PostgreSQL.

Update: Eric Kerin points out in his comments that the PreparedStatement interface does this, and after some testing I have found that this is the case. My excuse is that there is nothing in the javadoc for the SQL package or the PreparedStatement interface that explains this. Instead the documentation promotes it for optimizing frequently used statements. See my reply below for further responses to comments.

Also, there is a good article on this issue at the Open Web Application Security Project, which I found by googling for java and "sql injection".

The current situation places the responsibility for security on the thousands of application programmers, who must now dig into the internals of the database(s!) on the backend of their applications. If instead, the database development teams provided a parser for each field of data, it would be possible to determine if the input would result in a message, like this one that I was able to create from testing various input sequences:

I'm still working on parsing that construct and reworking it in a way that does not reject the data out of hand, because it might be a legitimate description of an Event, or possibly a Trigger definition. I am fortunate, because the input is not read from the network. You might not be so lucky.

And before I leave this topic/rant, I must point out that application programmers need to work closely with their DBAs to be sure that permissions are set on tables to allow only as much access as absolutely necessary and no more. If you don't have a DBA and/or maintain the database yourself, you need to become very familiar with the levels of GRANTing access and the use of roles to at least limit the damage when SQL injection attacks succeed. In my own experience, as well as reports from others, the attacks on applications, and databases especially, is continuing to increase.

If anyone is interested in the database security in the Realeyes UI, check out the Database and ValidatorDBForm modules, and then see how they are used in any of the Window*Base.java modules at the UI subversion repository. The ValidatorDBForm class includes the InterfaceAdminExtendedProcessing interface to do extra contextual error checking, which really is the job of the application. There are some pretty good examples of its use in the WindowsRules(Triggers/Actions/Events).java modules.

I'm pretty sure I'm talking for all application developers when I say (as a security guy), "Hey database developers, a little help!"

Later . . . Jim

Saturday, March 14, 2009

Java Search

I think I'm finally getting the hang of Object Oriented programming. I have been working on the user interface to provide all administration from it and add quite a bit of usability.

Over the past few days, I added search to the playback window. Because the playback window has two frames (Text classes in Java), this is a bit trickier than your standard text search. To begin with, I am allowing the search to be limited to one or the other frame, as well as using both. This means that the search class has to be aware of each frame.

To be able to highlight the text, the Text class actually needs to be defined as StyledText. When the text is found, the replaceStyleRanges method is called to highlight it. For now, I am leaving it highlighted, thinking that it is more helpful to be able to see all of the found selections. The current found text is displayed in reverse video by using the setSelection method. This one has to be reset by setting the selection range to zero before setting the new selection.

I thought about being able to use a single find window to search multiple playback windows, but this made my head hurt. However, it did seem friendly to share the search strings between playback windows. So I created a string array in the global variables class, and store the strings there. I even save them to the preference store so that they are maintained over application restarts.

If you are interested in this code, check out the WindowPlayback*.java source modules in the subversion repository. The preference store is defined and initialized in the Globals.java and MainWindow.java modules.

The beauty of the OOP style is that almost all of my code is spent managing indexes. The heavy lifting is done by other classes and their methods. So I am hopeful that these GUI enhancements will be finished pretty soon, and I will build another set of download packages by the end of the month.

Later . . . Jim

Monday, February 23, 2009

Whatcha Doin'?

As soon as I put the latest download on SourceForge, I started working on the user interface. I am hoping to bring it up to a version 1.0 level of usability. At the rate I am going, I expect to have a download ready in about a month.

So that's what's coming. But what I barely mentioned, was a new feature in the last package that I added a week or so before it was built. The reason was, I wanted to see how well it worked in the pilot project before explaining it. But now I have done that, and I am pretty happy with it.

The feature is a comparison of the client and server session size. When a TCP session ends, the client data is multiplied by a factor, and if that is larger than the server data, it is reported. The multipliers are specified by port, so each protocol can be handled uniquely.

In the pilot, I have set port 80 to have a multiplier of 1, just as a proof of concept. There have not been too many reports on it, and those can be grouped in two categories:

Document uploads: This was what I was hoping to see. And I believe that it means if there is an exploit that loads a rootkit, that would be reported, as well.

Large cookies: I never realized how much is saved in some cookies. I have seen several (Facebook, I'm looking at you!) that are over 4Kbytes.

Miscellaneous: The rest of the reports are usually client requests for many files, but some of them don't get sent for some reason.

Again, this is mostly a proof of concept feature and I hope to expand on it down the road. But it gives me the sense that the Realeyes IDS is capable of detecting behavior. I think that's pretty cool.

Later . . . Jim

Friday, December 5, 2008

Punishment vs. Prevention

Punishment

Recently, F-Secure released a report titled, "Growth in Internet crime calls for growth in punishment". The article and the associated report cite F-Secure's research and several specific incidents to make the case for creating an 'Internetpol' to fight cybercrime. It is their conclusion that "against a background of steeply increasing Internet crime, the obvious inefficiency of the international and national authorities in catching, prosecuting and sentencing Internet criminals is a problem that needs to be solved."

The data that is used to reach this conclusion is tenuous at best. The primary fact cited is that the number of signatures used in the F-Secure detection database has increased three times over a year ago. This could be explained in many ways, with one of the main ones being that exploit creators have adapted to signature based detection by automatically generating variations of the original, which requires many more signatures to detect a single basic exploit. Numbers alone do not tell the story.

As a side note, this is one of the strengths of the Realeyes IDS. While the rules include specific characters to be matched, they can be detected in any order and then correlated with an Action. At the next level, multiple Actions can be correlated with an Event. This allows many variations of an exploit to be defined by a single Event rule.

F-Secure's anecdotal evidence of outbreaks is even less convincing. It is just as easy to conclude that attacks are more targetted than a few years ago, when a single worm could infect millions of systems, and infer from this that software development has become at least good enough to deter the easy attacks. Yet, neither scenario is absolutely supported by the evidence.

But even if the problem were defined correctly, the solution presented is not. First and foremost, what is a cybercrime in international terms? Most countries have not updated their own laws to meet the conditions presented by the Internet. The thought that Brazil, Russia, India, China, the UK, the US, and all the other countries with Internet access could agree on a common set of laws to govern Internet usage is a stretch, to say the least.

Then there is the issue of prosecution. The situation of a perpetrator who is in a different country from the computers attacked would probably not be any different from how that is handled today. And it is all too common for bureaucratic agencies to use quantity instead of quality to prove success. This initiative would very likely result in many low-level 'criminals' and even some innocent people being swept up in the new dragnet.

Finally, I find it extremely simplistic to suggest dumping society's problems on law enforcement. A huge question is how this internetpol organization would be staffed, especially considering that existing law enforcement agencies are finding it challenging to enforce existing laws in the Internet environment. Between jurisdictional issues and competition for the qualified candidates, the new agency would certainly create inefficiencies. And where is the funding to come from?

Prevention

I believe that legislatures need to update laws to define what is cyber crime. The recent case on cyberbullying has produced potentially bad precedents that need to be addressed, and soon. But most of this effort should focus on adapting current law to the Internet and only creating new laws where they are justified by a unique situation.

The truth is, much of the problem is technological. SQL injection attacks are an example. Currently, every application programmer is expected to parse input for this. But many application programmers hardly know what a database is, much less how to protect against all the possible variations of SQL injection. The ones who do know that are the database developers. Therefore, the security community should be calling for all xDBC libraries to include methods to validate input for applications.

The F-Secure report cited botnets as one of the primary security concerns. The root cause of botnets is spam Email. If this were not such a lucrative business, it would not be such a problem. One of the solutions is to force strong authentication in Email protocols. And this is just one example. The security community should support an organization that could act as consultants to protocol committees to define strong security solutions for Internet protocols. That organization could also focus on convincing vendors and users to implement those solutions.

There are many guides on secure programming, but how many application developers have studied them? This should be mandatory, because if exploiting vulnerabilities were hard, there would not be nearly as many attacks. The security community could help produce more secure applications by establishing a certification program for secure programming.

Realistically however, the biggest part of the problem is unaware users. We in the industry talk about best practices, but that is meaningless to most users. We need to convince management to ensure that users get adequate training about good security practices and we need to be specific about what that training includes.

Finally, I feel compelled to issue the warning, "Be careful what you wish for, because you might just get it." If the government takes over Internet security, there is sure to be a large amount of new regulation imposed. And this could mean security companies like F-Secure would have to devote a lot of resources towards compliance. I think it would be much better for us to take responsibility for finding solutions ourselves.

Later . . . Jim

Tuesday, September 23, 2008

My App Fails the LSB

My position on the Linux Standard Base has evolved. When I first heard about it, I was all for it. The LSB as a standard could be useful to some, but now I disagree with the goals of the LSB working group. To be sure, this post is not about dissing the Linux Foundation. They have many worthwhile projects.

What follows is my experience with the LSB Application Checker, my take on the purpose of the LSB, and my own suggested solution for installing applications on GNU/Linux distributions. The Realeyes application failed to certify using the checker v2.0.3, which certifies against the LSB v3.2. Everything that it called out could be changed to pass the tests, but I will only consider correcting a few of the 'errors'.

After building the Realeyes v0.9.3 release, I collected all executable files in a common directory tree, downloaded the LSB application checker, and untarred it. The instructions say to run the Perl script, app-checker-start.pl, and a browser window should open. The browser window did not open, but a message was issued saying that I should connect to http://myhost:8889. This did work, and I was presented with the Application Check screen.

There was a text box to enter my application name for messages and one to enter the files to be tested. Fortunately, there was a button to select the files, and when I clicked on it a window opened that let me browse my file system to find the directories where the files were located. For each file to be tested, I clicked on the checkbox next to it, and was able to select all of the files, even though they were not all in the same directory. Then I clicked on the Finish button and all 87 of the selected files were displayed in the file list window.

When I clicked on the Run Test button, a list of about a dozen tasks was displayed. Each was highlighted as the test progressed. This took less than a minute. Then the results were displayed.

There were four tabs on the results page:

Distribution Compatability: There were 27 GNU/Linux distributions checked, including 2 versions of Debian, 4 of Ubuntu, 3 of openSUSE, 3 of Fedora, etc. Realeyes passed with warnings on 14 and failed on the rest.

Required Libraries: These are the external libraries required by the programs written in C. There were nine for Realeyes, and three (libcrypto, libssl, and libpcap) are not allowed by the LSB. This means that distros are not required to include the libs in a basic install, so they are not guaranteed to be available.

Required interfaces: These are the calls to functions in the libraries. There were almost a thousand in all, and the interfaces in the libraries not allowed by the LSB were called out.

LSB Certification: This is the meat of the report and is described in some detail below.

The test summary gives an overview of the issues:

Incorrect program loader: Failures = 11

Non-LSB library used: Failures = 4

Non-LSB interface used: Failures = 60

Bashism used in shell script: Failures = 21

Non-LSB command used: Failures = 53

Parse error: Failures = 5

Other: Failures = 53

The C executables were built on a Debian Etch system and use /lib/ld-linux.so.2 instead of /lib/ld-lsb.so.3. The Non-LSB libraries and interfaces were described above, but there was an additional one. The Bashisms were all a case of either using the 'source' built in command or using a test like:

  while (( $PORT < 0 )) || (( 65535 < $PORT )); do

which, in other Bourne shells, requires a '$(( ... ))'. The parse errors were from using the OR ("||") symbol.

The fixes for these are:

Use the recommended loader

Statically link the Non-LSB libraries

Use '.' instead of 'source'

Rework the numeric test and OR condition

So far, all of this is doable, sort of. But every time a statically linked library is updated, the app must be rebuilt and updates sent out. Also, the additional non-LSB library is used by another one. So I would have to build that library myself and statically link the non-LSB library (which happens to be part of Xorg) in it (and that one is part of the GTK). The reason I am using it is because the user interface is built on the Eclipse SWT classes, which uses the local graphics system calls to build widgets.

The non-LSB commands include several Debian specific commands (such as adduser), and for my source packages I had to rework the scripts to allow for alternatives (such as useradd). But the other disallowed commands are:

free: To display the amount of free memory

sysctl: To set system values based on the available memory

scp: Apparently the whole SSL issue is a can of worms

java

psql

Finally, all of the other 'errors' are: Failed to determine the file type, for the file types:

DTD and XML

Part of the problem with the LSB is that it has bitten off more than it can chew. Apparently Java apps are non-LSB compliant. So are apps written in PHP, Ruby, Erlang, Lisp, BASIC, Smalltalk, Tcl, Forth, REXX, S-Lang, Prolog, Awk, ... From my reading, Perl and Python are the only non-compiled languages that are supported by the LSB, but I don't know what that really means (although I heartily recommend to the Application Checker developers that they test all of the executables in their own app ;-). And I suspect that apps written in certain compiled languages, such as Pascal or Haskell, will run into many non-LSB library issues.

Then there are databases. Realeyes uses PostgreSQL, and provides scripts to build and maintain the schema. Because of changes at version 8, some of these scripts (roles defining table authorizations) only work for version 8.0+. The LSB Application Checker cannot give me a guarantee that these will work on all supported distros because it didn't test them. I have heard that there is some consideration being given to MySQL, but from what I can tell, it is only to certifying MySQL, not scripts to build a schema in MySQL.

After all this kvetching, I have to say that the Application Checker application is very well written. It works pretty much as advertised, it is fairly intuitive, and it provides enough information to resolve the issues reported by tests. My question is, "Why is so much effort being put into this when almost no one is using it?"

An argument can be made that the LSB helps keep the distros from becoming too different from each other and without the promise of certified apps, the distros would not be motivated to become compliant. But I only see about a dozen distros on the list, with Debian being noticeably absent. And yet, there is no more sign of fragmentation in the GNU/Linux world than there ever was.

My theory on why UNIX fragmented is that proprietary licenses prevented the sharing of information which led to major differences in the libraries, in spite of POSIX and other efforts to provide a common framework. In the GNU/Linux world, what reduces fragmentation is the GPL and other FOSS licenses, not the LSB. All distros are using most of the same libraries, and the differences in versions are not nearly as significant as every UNIX having libraries written from scratch.

I have to confess, I couldn't care less whether Realeyes is LSB compliant, because it is licensed under the GPL. Any distro that would like to package it is welcome. In fact, I will help them. That resolves all of the dependency issues.

While I am not a conspiracy theorist, I do believe in the law of unintended consequences. And I have a nagging feeling that the LSB could actually be detrimental to GNU/Linux. The only apps that benefit from LSB compliance are proprietary apps. The theory behind being LSB compliant is that proprietary apps can be guaranteed a successful installation on any LSB compliant GNU/Linux distro. I'm not arguing against proprietary apps. If a company can successfully sell them for GNU/Linux distros, more power to them. However, what if proprietary libraries manage to sneak in? This is where the biggest threat of fragmentation comes from.

But even more importantly, one of the most wonderful features of GNU/Linux distros is updates, especially security updates. They are all available from the same source, using the same package manager, with automatic notifications. If the LSB is successful, the result is an end run around package managers, and users get to deal with updates in the Balkanized way of other operating systems. That is a step in the wrong direction.

The right direction is to embrace and support the existing distro ecosystems. There should be a way for application teams to package their own apps for multiple distros, with repositories for all participating distros. The packages would be supported by the application development team, but would be as straightforward to install and update as distro supported packages.

There is such a utility, developed by the folks who created CUPS. It is called the ESP Package Manager. It claims to create packages for AIX, Debian GNU/Linux, FreeBSD, HP-UX, IRIX, Mac OS X, NetBSD, OpenBSD, Red Hat Linux, Slackware Linux, Solaris, and Tru64 UNIX. If the effort that has gone into LSB certification were put into this project or one like it, applications could be packaged for dozens of distros.

And these would not just be proprietary apps. There are many FOSS apps that don't get packaged by distros for various reasons, and they could be more widely distributed. Since the distros would get more apps without having to devote resources to building packages, they should be motivated to at least cooperate with the project. And don't forget the notification and availability of updates.

As a developer and longtime user of GNU/Linux ('95), I believe that all of the attempts to create a universal installer for GNU/Linux distros are misguided and should be discouraged. I say to developers, users, and the LSB working group, "Please use the package managers. A lot of effort has been put into making them the best at what they do."

Later . . . Jim

Sunday, September 21, 2008

Realeyes IDS v0.9.3 Ready for Download

The packages for Realeyes IDS release 0.9.3 are ready for download. This version improves stability over the previous one and includes several new features.

If there are any problems, please notify the team. Thanks.

You may have noticed that I finally finished the First Edition of the Realeyes IDS Manual. I made it as useful as I could. But any comments are welcome.

Later . . . Jim

Saturday, August 30, 2008

Building a Debian GNU/Linux package

I recently built packages for the 0.9.3 release of Realeyes. These include both source and Debian GNU/Linux packages. For reasons that I have forgotten, I built the Debian packages before I started working on the source packages, but I'm glad that I did. Going through that process made me add several things that I would have overlooked, especially man pages.

I actually built an entire distribution as part of the process of creating my packages. It does not meet the requirements for re-distribution, but is very handy for laying down a fresh install with exactly what is needed to run Realeyes. I will provide the steps for that in another post.

The source is comparatively easy, once all of the files are collected, a tar file is created of the directory. The trick here is to verify that the C code will compile. I have read a lot of comments about how straight-forward the standard configure/make installation procedure is. But from a developer's perspective, there are several issues, mainly having to do with autoconf and automake. I don't have enough space here to discuss these, and I really don't have any wisdom to impart if I did, because I have only learned as much as I needed for my own packages.

Debian packaging is somewhat more complicated. A package that is included in a Debian distribution must go through fairly rigorous tests. There are several scripts for checking correct packaging procedures, including lintian and linda. These verify such conditions as:

executables are built correctly

man pages exist for every executable file

naming conventions are followed

Debian documentation is formatted correctly

I have seen a few HOWTOs on building a Debian package, and there is a huge amount of information on the Debian site. But I still had to piece together a working plan for myself, so I thought it would be worth sharing it. There may be better ways to do some of it, but I have written scripts that get me through the process with only a little manual effort. And even when there is an automated process, it is still good to know what is happening under the hood. So without further ado, I offer my experience.

Building a Debian Package

I. Read enough of the manuals to get a sense how Debian packages are built, and then keep links to them for reference:

The Debian Reference manual

The Debian Policy manual

The Debian Developers reference

II. Create a working directory for the package

Get a package to use as a model, using the following commands to extract the package files and the Debian metadata files:
```
apt-get -d -y --reinstall install package_name
dpkg -x package_name 
cd package_dir
dpkg -e ../package_name
```

Create a working directory, and under it make the directories to be installed for the package, even if there are no files to be saved in them. These may include:

working_dir/etc/package_name
working_dir/etc/init.d
working_dir/usr/sbin
working_dir/usr/share/package_name
working_dir/usr/share/doc/package_name
working_dir/usr/share/man
working_dir/var/log/package_name

Make the Debian control directory with its files (as needed)

working_dir/DEBIAN
- control: This contains the description of the package, including dependencies, architecture, and the description used by aptitude or synaptic -- use the model to create this for the first time
- conffiles: This contains any configuration files installed with the package -- I put mine in /etc/realeyes
- preinst: This is a shell script that runs before the package is installed, if it exists
- postinst: This is a shell script that runs after the package is installed, if it exists -- I use it to create user IDs
- prerm: This is a shell script that runs before the package is de-installed, if it exists
- postrm: This is a shell script that runs after the package is de-installed, if it exists

Populate the directories with the application files: Use the model to help understand what goes where

III. Use the maintainer tools to verify package acceptability

Build the package:

cd working_dir
dpkg-deb --build package_dir package_name

lintian/linda: Check for package discrepancies. Note that lintian and linda are not in the standard package and must be installed separately. Lintian uses Perl and linda use Python, so there may be several dependencies installed with them.
```
lintian -i package_name > package_name.lintian
linda package_name > package_name.linda
```
Fix all the problems, here are some helpful hints from my experience:
- Man pages: txt2man is a program that takes ascii text and converts it to a man page. It works for simple pages, but the resulting groff file may have to be edited manually in some cases. Use 'gzip -9' to compress man pages.
- Compiled programs: Compiled programs must be stripped. Use the command:
```
install -s
```
- Identify all non-executable files in system directories (ie. /etc/package_name) in the package's DEBIAN/conffiles.
- Lintian provides the section in the Debian Policy manual that describes the requirement that was flagged.

Sign the package: Create a GPG key for the package and sign each package with the key
```
gpg --gen-key
dpkg-sig -s builder pkg.deb
```
NOTES:
- The public keyring is in $HOME/.gnupg/pubring.gpg
- There should be a lot of entropy on the system to help the random number generator, (grep -R abc /usr/* seems to work well)
- Issue the command 'cat /proc/sys/kernel/random/entropy_avail' to find out how much entropy is currently available, it should be at least above 1,500

At this point, the package can be installed using the dpkg command. However, if there are dependencies that must be installed, dpkg will issue a warning about them, but does not handle their installation. So if you want to go to the next level, here is what you have to do.

IV. Repository directories

A Debian repository has a relatively simple directory structure to maintain the packages and metadata about them. An installation ISO is basically a repository tree with just the stable packages. A custom repository tree can be added to the apt sources.list to be accessed just like officially maintained packages, with aptitude or synaptic.

In the top repository directory, the following are mandatory:

md5sum.txt: The list of all files in the tree with their md5 checksums

To create the md5sum.txt file for my mini-distro, I wrote a script that ran in the top ISO directory. It did a recursive ls, ran md5sum on all regular files, and wrote the output to the md5sum.txt file. I keep that as a template and only update the files that change.

pool: The subdirectory where packages are kept. Under the pool directory, there are a few pre-defined directories where the different categories of packages are kept. Anyone who has edited a sources.list file has seen most of these:
- main: Technically, these are packages that meet the Debian Free Software Guidelines (DFSG), but I think of them as the officially maintained packages
- contrib: Contributed packages are DFSG, but depend on packages that are not -- I use this for my own packages, even though I don't have any non-DFSG dependencies
- non-free: These are non-DFSG packages
The structure of each of these is the same. The package directories are in subdirectories named with the first letter of the package name. The exception is libraries, which are under directories named libletter, which is the prefix of the library package name. Below these subdirectories are the directories with the actual package files.

dists: The dists directory contains the metadata about packages. There is a directory for the distribution, in this case, etch. In an ISO, there are also the directories, frozen, stable, testing, and unstable, which are links to the distribution directory. In a repository, these may have their own files. But for my purposes, I only include the distribution subdirectory.

Under the distribution subdirectory are the following:
- Release: This file describes the packages, including the architecture, the components, and contains md5sums for the package metadata files -- the file information also includes the file size, and since there are only a few of these, I created the original by hand
- main: This directory contains the metadata about the main packages
- contrib: This directory contains the metadata about the contrib packages
The structure of main and contrib is the same, and again, I only use contrib. The architecture directories are below contrib, and in my case, there is only binary-i386. In the architecture directory there are three files:
- Packages: This uses information from the DEBIAN/control file and adds such things as the full path of the package file
- Packages.gz
- Release: This contains metadata about the contrib directory

I also put a few optional files in the top level directory. These include a copy of the GPL (I use version 3), installation instructions, and the public key for the signed packages. The installation instructions explain how to add the public key so that aptitude and synaptic can validate the packages.

V. Add the packages

Copy the packages to the appropriate pool directory. In my case, this means copying them to:

Create an override file for the packages. This consists of a line for each packages with the following information:
In my case it looks like this:
```
realeyesDB   optional  net
realeyesDBD  optional  net
realeyesGUI  optional  net
realeyesIDS  optional  net
```
The man page for dpkg-scanpackages (the next command) has a description of each field and says that the override file for official packages is in the indices directory on Debian mirrors.

Build the metadata:

dpkg-scanpackages \
pool/contrib/ override.etch.contrib > \
dists/etch/contrib/binary-i386/Packages
cd dists/etch/contrib/binary-i386
gzip -c -9 Packages > Packages.gz
cd ../..
md5sum contrib/binary-i386/Packages* > md5.tmp
ls -l contrib/binary-i386/Packages* >> md5.tmp

The file, md5.tmp, is edited to put the file size after the md5 checksum and before the file name, and the file listing lines are deleted. Then the file, Release, is edited to read in the file, md5.tmp, at the end and the duplicate lines are deleted.

gpg --sign -ba -o Release.gpg Release
cd ../..
cp -a ~/.gnupg/pubring.gpg RE_pubring.gpg
md5sum ./INSTALL* > md5sum.txt
md5sum ./GPL* >> md5sum.txt
md5sum ./dists/etch/Release >> md5sum.txt
md5sum ./dists/etch/contrib/binary-i386/* >> md5sum.txt
md5sum ./pool/contrib/r/*/* >> md5sum.txt

VI. Installing the package

The instructions for installing this package are:

Copy the debian packages to a directory that will be used for the initial installation and future updates, such as, /var/tmp/realeyes. Untar the packages:
```
tar xvzf realeyes_debian.tar.gz
```
Change to the top level packages directory and add the public key file, RE_pubring.gpg, to the Debian trusted sources with the command:
```
apt-key add RE_pubring.gpg
```
Edit the file /etc/apt/sources.list to add the line:
```
deb file://install_dir/realeyes/ etch contrib
```
Update the package lists with one of the following methods:
- On the command line enter:
```
apt-get update
```
- In aptitude, select Actions -> Update package list

Install the package using aptitude or synaptic

So there you have it. I hope it shortcuts the learning curve a little.

Later . . . Jim

Friday, August 22, 2008

Messages

I recently saw a question on Slashdot asking how to handle application messages. Most of the responses were along the lines of "only output what is important". Of course that implies that the programmer knows what is important to every user, which isn't very likely.

In Realeyes, the approach is different. First, there is a message for just about everything except normal data collection and analysis. This includes parsing configuration files, network connection activity, administrative commands, and, of course, errors. Each message is assigned a type code, such as critical, error, warning, informational, etc. Then, in the main configuration file, it is necessary to explicitly request for warnings and informational messages to be logged.

This way, a newcomer to the application can see all messages to get a sense of if and how the program works. When the repetitious messages are no longer useful, they can be ignored. Although the warnings may be helpful in troubleshooting, they generally relate information about configuration issues that are already known, and could become a bit irritating, so the user is given the choice of recording them or not.

Of course, error messages are not optional. And there is a NOTE message type, that is not optional, used for things like the startup and shutdown messages.

If you are interested in seeing how this is coded, check out the files, RealeyesAE/src/rae_control.c and RealeyesAE/include/rae_messages.h, in the Realeyes subversion repository. In rae_control.c, the function, control_messages, handles writing the log file. And down about line 380, there is a message.

This brings up a couple of points. First, the application is actually a collection of processes. The child processes do not write messages to disk. Instead, they create the message in shared memory and put it on a queue using the macro, raeMESSAGE, which is defined in rae_messages.h. The parent process (called the manager) periodically checks the message queues and writes messages to the log files. It also has a shutdown function that is called even after a system interrupt, which prints messages one last time.

Second, the message documentation is in the source code. I use doxygen to create program documentation. I found a way to create separate files for inline documentation and that is how the message logs are created. This makes it a lot easier to keep messages up to date, and to be sure that all messages are documented. Unfortunately, Dmitri did something after version 1.3.6 of doxygen, so that this no longer works. For now, I keep an older version of doxygen around, but eventually, I plan to create a script to handle it. In that, I would like to output messages in both HTML and ODF.

If this has been of interest and you have any tips for making maintenance of applications more efficient, please share.

Later . . . Jim

Monday, August 4, 2008

Program Security

Good security practices are multi-layered. The levels that are addressed in the Realeyes application are:

code vulnerabilities

program interaction

privileges and access

Code vulnerabilities are bugs that can be exploited to gain control of a program simply by interacting with it. Therefore secure programming starts with good coding practices. I saw David A. Wheeler present his HOWTO on secure programming practices, and highly recommend it. He covers issues in reading and writing files, how to prevent buffer overflows, user privileges, and much more. I was glad to have seen his presentation early in the design phase of the Realeyes project.

The only problem with his, and almost every other tutorial/reference I have ever read, is that it covers so much ground that each individual topic is a little short on detail. Even my favorite reference books by W. Richard Stevens leave a lot of code as an exercise for the reader. I am not going to write a tutorial, but the way issues specific to the Realeyes application have been handled are detailed implementation examples. Where files are referenced, each file path is relative to the subversion repository for Realeyes. And BTW, I have a pretty good background in this, but I am sure that there are others who have more than me, so I would be happy to hear of any suggestions for improvements.

There are four components in the Realeyes application: the IDS, the database, the user interface, and the IDS-database interface (database daemon or DBD). Each has unique security issues, including interacting with other components.

The database is where user IDs are maintained. In the PostgreSQL database, it is possible to create groups that are granted specific access rights to each table. Then each user is assigned to a group and inherits those rights. The groups in the Realeyes database schema are defined in RealeyesDB/sql_roles/create_roles.sql and include:

realeyesdb_dbd: Only used by the DBD program to insert data from the IDS sensors and retrieve commands and new rules to be sent to the sensors

realeyesdb_analyst_ro: An analyst-read-only can view data and rules, and produce reports

realeyesdb_analyst: An analyst can view data and rules, create incident reports, and produce reports

realeyesdb_analyst_dr: An analyst-define-rules can do everything an analyst can, plus define new rules

realeyesdb_admin: An application administrator can do everything an analyst-define-rules can, plus create and modify new users and other application information

The user interface and DBD are written in Java, and connect to the database directly. The connection is always encrypted. So this layer of security is an administrative issue, to make the database host as secure as possible. The only additional feature offered here is the selection of a different listening port for the database than the default. The classes that interface with the database are in the files RealeyesDBD/DBD_Database.java and RealeyesGUI/Database.java.

One issue that was raised early in the pilot project I am running at a local college was how secure the captured data is. The data is stored in the database but not in raw form. The user interface reads this and reformats it, but does not print that to a file. An analyst could cut and paste the data, so that becomes a personnel issue. I have debated whether to provide the capability to write to file, but for the time being am leaning away from it. The user interface does generate summary reports and writes those to file, but that does not include any of the captured data.

The DBD connects to both the database and the IDS sensors. The IDS connection is optionally encrypted so that if it and the DBD are on the same host, the encryption overhead can be eliminated. Also, the address of the DBD host must be defined to the IDS sensor for the connection to be accepted, and the ports that are used are defined in the configuration. A sample configuration is in the file RealeyesDBD/sample_config/realeyesDBD.xml, and the code to parse it is in RealeyesDBD/RealeyesDBD.java.

One big issue I discovered regarding Java and encrypted connections is that, in JRE 1.4, it is possible to maintain multiple connections using the SocketChannel class, and it is possible to encrypt them using the SSLSocket class. However, it is not possible to do both at the same time. In JRE 1.5, the flaw is addressed, but the solution is very ugly. It essentially requires an application programmer to write a TCP/IP API. The argument for this is that Java might be used on networks other than TCP/IP, so the solution must be broad. Hopefully, JRE 1.6 will provide a solution for 99.99% of Java application programmers, and then the remaining 0.01% will still have a partial solution for their needs.

While I have considered porting the DBD to C++, the current code handles this in two ways. To begin with, there are two connections between the DBD and each IDS sensor. One is for data and the other is for control information. The data connection is handled by starting a thread that is dedicated to that connection, and that code is in the file RealeyesDBD/DBD_Handler.java. The control information is more sporadic, which is why I would have liked to use the SocketChannel selector. The workaround is to have the DBD poll each IDS sensor every 8 seconds if there has been no activity on the connection. The code for this is in the file RealeyesDBD/DBD_Control.java.

The IDS is a collection of C programs. These are started by running 'realeyesIDS', which spawns child processes. I discussed the reasons for choosing interprocess communication over threads in "Loose Threads". The main process, called the Manager, and all but one of the child processes run under the superuser ID. This is because they use a large shared memory buffer, and the way that is built it can only be accessed by the superuser. The files that handle managing this buffer are:

RealeyesAE/src/rae_mem_ctl.c: Contains the code that the Manager uses to allocate, initialize, and do garbage collection for the buffer

RealeyesAE/src/rae_mem_mgmt.c: Contains the code that all processes use to allocate and free individual buffers

RealeyesAE/src/rae_lock_mgmt.c: Contains the code that all processes use to prevent memory locations from being changed incorrectly

The process that communicates with the DBD, called the Spooler, is designed with several security features. First, as the Spooler is started, it changes the user ID to one which has very limited access. It then changes the current directory to one that contains only the files it uses, and sets that to its root. The only communication from the Spooler to any of the other processes is through pipes, which means that it is serialized and straightforward to validate. Finally, the configuration file specifies the DBD host, and only connections from it are accepted.

The files that handle the Spooler communications are:

RealeyesIDS/data/rae_analysis.xml: The configuration file where the DBD host is defined (it is the manager's configuration file and contains a lot more, but the Spooler definitions are in their own section)

RealeyesIDS/src/rids_spooler.c: The Spooler initialization function where the user ID and directory are set is near the end of the spooler_init function

RealeyesIDS/src/rids_net_mgmt.c: The Spooler is the only process to use the network management functions, including the SSL setup, the listener which validates the connection request, and the exchange of data

Ultimately, the best way to program security is to think about how to exploit vulnerabilities in the code. And since the purpose of the Realeyes IDS to detect exploits, I spend a lot of time thinking about it in general. So I have a fair amount of confidence that it is a good example of a securely coded network application.

Later . . . Jim

Saturday, July 26, 2008

Modularity

Realeyes was planned with the intention of supporting IPv6, and now that the basic functionality is in place and (mostly) working, I am adding full support for it. This means several things, including:

Deploying a Realeyes IDS sensor on an IPv6 network

Analysis of IPv6 packets by the IDS application

Inserting IPv6 addresses and data in the Realeyes database

Defining rules for IPv6 addresses

Displaying IPv6 addresses and headers from the user interface

I will describe this in more detail in a later post, but for the moment, I need a motivational boost, so I decided to give myself a pat on the back.

The way that IDS functionality is added is through plugins that perform specific functions. At this point, data collection and high level analysis are essentially complete. I am adding a few new features in the IDS only to the session handler and low level analysis plugins. In Realeyes terminology, this is the Stream Handler and the Stream Analyzers.

The Stream Handler parses the IP header to find the session ID, and set the location of the payload, such as TCP or UDP headers and their data. I set up two hosts on my local network for IPv6 connections. On a Linux system, this is as simple as issuing an ifconfig command on both systems and, for ease of use, adding the remote host to the /etc/hosts file:

host100> ifconfig eth0 inet6 add fec0:0:0:1::100/64

fec0:0:0:1::200 host200

host200> ifconfig eth0 inet6 add fec0:0:0:1::200/64

fec0:0:0:1::100 host100

Next, I established SSH and FTP sessions between them. I had the code to find the payload written, but this was the first time I had tested it. It took a couple of tries to get it right because the way IPv6 extension headers work is a bit tricky. But when I actually captured some sessions, they were displayed correctly in the user interface.

I then added the code to display the IPv6 headers in the user interface. This formats the main header and each extension header using human readable field names followed by the actual values. Because the header type of each extension header is in the previous header, this was also a little tricky to get working.

The IDS Stream Handler is also where IP fragments must be reassembled. I was really happy that after copying the IPv4 reassembly code and changing all instances of "v4" to "v6" and handling a couple of variables differently, the IPv6 reassembly worked. This is an example of the value of modularity and the use of variables in code.

As an aside, I learned this lesson in my first year of Computer Science. The grad student instructor had us program an assembler that could handle about 8 operations, each with one operand of 6 characters. The next assignment was to modify the assembler to add a couple of operations, some of which took two operands, and the length of operands increased to 8 characters. Those who hard coded the original assignment had a lot of work to do (I had hard coded some things, but not all). And yes, the third assignment was more of the same.

I hated that guy (as did most of my classmates), because while the lesson was legitimate, laboratory exercises are not applications that will grow in the real world. As first year students, most of us did not have enough experience to develop programs with even that level of sophistication, and he did not recommend that we incorporate it in the assignment. And with other classes to deal with, getting a single program to work at all took time that was in limited supply. His excuse was, in the real world we would constantly be faced with changing requirements at a moments notice, and thus he was doing us a favor.

Having been out here for over 20 years, I have yet to run into anything remotely like what he described, although I do make it a point to be anal about getting adequate requirements descriptions up front. The people I have worked for wanted me to succeed, because it reflected on them. Some were more helpful than others, but I have never worked on a task where the requirements changed wildly, making it impossible to complete. Maybe I'm just lucky.

Incidentally, the way I tested reassembly was to use the latest version of netcat, which supports IPv6 sessions. I created a file that had over 8K of test data, and then sent it over a UDP session, which tried to send the entire file in a single datagram, and forced the TCP/IP stack to fragment it:

server> nc -6 -u -l -p 2000 fec0:0:0:1::100

client> nc -6 -u fec0:0:0:1::200 2000 < test.data
Anyhow, I am now working on analyzing the IPv6 extension headers and expect that to be done within a week. After some tidying up, I will be building a new package for download with IPv6 and several other new features. So, back to work.

Later . . . Jim

Monday, June 23, 2008

Loose Threads

Realeyes is a somewhat complex application, both in terms of the number of components that interact with each other (4), and the complexity of those components, in particular the analysis engine/IDS, but also the database and database interface. The analysis engine and IDS are written in C, while the database interface and user interface are written in Java.

When I was planning the design of the analysis engine, I knew from the start that there would be multiple processes running. That left me with the decision of whether to use threads or interprocess communication. I know from painful experience that writing thread-safe code is hard (the TCP/IP stack written in System/390 assembler that I helped maintain). Therefore I chose to use interprocess communication. I actually had several reasons for choosing this over threads:

Writing thread-safe code is really hard.

Threads share the same address space and, while all analysis engine processes share some memory, some of them also use significant amounts of unshared memory. I was concerned that this might lead to the application running out of virtual memory.

For security reasons, the external interface runs under an ID that has lowered access and in a chroot jail. This means that interprocess communication would have to be used for at least this function.

The pcap library for capturing network traffic from the interface was going to be used, and I was pretty sure it could not be used in a threaded process.

I wanted to be able to control the priority of processes dynamically, and while the pthread_setschedparam man page says, "See sched_setpolicy(2) for more information on scheduling policies,", there is no man page for sched_setpolicy (I have searched the web for it).

Writing thread-safe code is really hard.

Long after going through this thought process, I discovered this paper by Dr. Edward A. Lee at UC Berkeley that supports my reasoning. After performing a formal analysis to prove that writing thread-safe code is really hard, Dr. Lee recommends that code be written single-threaded and then elements of threading (or interprocess communication) be added only as needed. Thank you, Dr. Lee.

This left me with the decision of which IPC techniques to use. There are essentially three:

Pipes

Message queues

Shared memory

I read an article about a test that compared the three (which I cannot find now) and shared memory won hands down (an order of magnitude faster, as I recall). Therefore, while pipes are used in the analysis engine to transfer small pieces of data or low priority information, shared memory is the primary mechanism.

Of course, shared memory is the most difficult to program because it requires a way of guaranteeing that the data stored in every memory location is correct at all times. This is handled in the analysis engine by all of the following methods:

Assigning memory locations to a single process that others cannot access

Using locks (or semaphores in glibc-speak) to serialize access, which means the operating system allows only the process holding the lock to access the locked memory location

Using a mechanism similar to locks (but without the overhead) to serialize access

The center piece of this is the memory manager. When the application starts, a single large block is allocated and made non-swappable. This means that the application never has to wait for a block to be swapped in from disk, which is not done for memory allocated by a process for its own use. This block is chopped up into pools which are in turn chopped up into buffers. (Note: This is an oversimplification, see the analysis engine slide show on the Realeyes technology page for more detail.)

The memory manager sets an "in use" flag to indicate that a buffer is being used, and clears it when the buffer is released. Each level of the analysis engine uses specific structures, and when the "in use" flag is set for a buffer, other processes are not allowed to access it unless the structure is explicitly passed to them. This is the way the first access method is implemented.

The second access method is actually used by the memory manager to obtain or release a buffer. But it is also used by processes to modify structures in memory that could be potentially modified by two processes simultaneously. Most books on programming with semaphores usually start by saying that POSIX semaphores are overly complicated. I don't disagree, but after a little experimentation, I simply wrote a set of functions to initialize, get, free, and release a single lock. As it turned out, my first attempt did not work well across all platforms where the application was tested. But the correction was basically confined to the functions, with only the addition of an index parameter to one of them that meant changing about a dozen calls in the analysis engine code.

The third access method is very much like message queues, but with the performance of shared memory. When a process has information (in a structure) to pass to another, it puts the structure on a queue that only it may add to and only one other process may remove from. The rule governing most of these queues is that the first structure in the queue may only be removed if there is another one following it. In programming terms, there must be a non-NULL next pointer. So the first process modifies the structure to be added, and the very last step is to set the pointer of the last item in the queue to the new structure's address.

Special handling is necessary for some queues. For example, if there is very little activity, a single structure could be on a queue by itself for a long time (in computer cycles). This is handled in some queues by adding a dummy structure after the one to be processed after a brief wait (maybe a hundredth of a second).

A side effect of the choice of processes over threads is that it is much easier to monitor a process than a thread. It is also quite a bit more straightforward to use a debugger on a process. So, all things considered, I recommend this over threads unless there are strong reasons against it.

Finally, I have to say that the Java code does use threads. However, they are treated like separate processes in that they don't share memory. All data is passed in arguments to the methods being called or in the method return value. This eliminates the most problematic aspects of making code thread-safe, but (I have discovered) not all of them. The other issues are memory-related, but it is memory that the application does not control, such as the window in which the application is displayed, or network connections.

Overall, I agree with Dr. Lee when he says that threads "are wildly nondeterministic. The job of the programmer is to prune away that nondeterminism." And I don't find it to be too much of a stretch when he continues that, "a folk definition of insanity is to do the same thing over and over again and to expect the results to be different. By this definition, we in fact require that programmers of multithreaded systems be insane. Were they sane, they could not understand their programs."

Later . . . Jim

Wednesday, June 11, 2008

Elitism Improves Productivity

The Realeyes IDS application includes multiple plugins that interact with each other. The basic means of communication is a structure with information about the status of a network session, put on a queue by one plugin and taken off by the next one to process the session.

At the lowest level, this is a Data structure, which defines the packet captured by the Collector. The Data structure is then taken by the Stream Handler which determines which session it belongs to and sets some information, such as the start time, and then puts a Stream Analysis Work Element (SAWE) on another queue. The Stream Analyzers perform matching operations on the packets based on the rules defined for each one. Then the Action Analyzer and Event Analyzer perform correlation on the results of the Stream Analyzers.

This works very smoothly, except for the fact that there are multiple Stream Analyzers and one Action Analyzer. The Action Analyzer can free Data structures, and it must not free any that are still being processed. Because all of this analysis is happening asynchronously, the fields that indicate the state can change while being tested.

To handle this, I created a separate field that is set once when the session is ready for the Action Analyzer. Initially, I tried to wait briefly for the Stream Analyzers to update these fields. Of course, briefly is in the eye of the beholder. I set the wait value to 1 microsecond, which is 0.000001 second.

But the standard clock in most Intel computers is actually ticking once per 0.1 millisecond, or 0.0001 second. This is like saying, "Give me a second," and then taking over a minute and a half. The result was that work piled up waiting on the Action Analyzer. Buffers could not be freed and the application could not run for more than a couple of hours in the pilot environment.

I finally realized that instead of waiting for the first SAWE on the queue, the Action Analyzer should try to find one that was ready. In other words, it should ignore the structures that didn't meet its standards, and only choose that of the highest quality. In still other words, it should be an elitist.

And low and behold, buffer usage became almost a non-issue. The application now runs for days without running out of buffers. (In fact, it usually crashes from a bug before it runs out of buffers, but I'm working on fixing those.)

This demonstrates that being described as an elitist can be a compliment.

Later . . . Jim

Friday, May 30, 2008

Code: Library and Plugins

I just realized that the name of the blog has technology in it, and I have hardly mentioned code. The Realeyes project was originally started as a network Intrusion Detection System project. I have worked on several systems in which an attempt was made to design them modularly, but gradually, functions that were supposed to be generic incorporated application specific data and code. This increases the chance of creating errors when such a function is called by many other functions.

So I decided to create a library and then build the application on it. The library is called the Realeyes Analysis Engine. Applications are built on the library by creating plugin programs that call library functions. The first application had nothing to do with networks--it was just a series of random numbers that were organized according to the high order digits and then the low order digits were analyzed for patterns.

When I started writing the network IDS code, I found that I needed more control over some of the library functions, so I added hooks for the application. For any of you who have read about these so-called hooks but aren't sure what they are, they also go by the name of 'callbacks'. And what that means is that the library function calls an external function with predefined parameters. The name of the function may be specified or a pointer to the function may be initialized, and I use both.

For example, the library's main function calls three functions that every plugin must include, even if all they do is immediately return:

local_plugin_init(): This allows anything that needs to be done before the parser runs to be handled

plugin_parser(xml_main_structure, xml_dom_tree): The XML file gets parsed for syntax by the analysis engine, then passes a Document Object Module (DOM) tree to the plugin which parses the values

plugin_process(): This is where the plugin does its main job

The plugin parser is particularly interesting. The analysis engine uses libxml2 to parse an XML configuration file and build a tree of the values in the configuration file. (And yes, I have used the Expat library, which implements the Simple API for XML (SAX). But SAX is simple only for implementing the library, not the application, and I would only use it if I was under serious memory constraints--which is not the case for most configuration files.) The DOM tree is read by the plugin parser. But the code to read the tree is a bit hard on the eyes, not to mention being a typo magnet, as this simplistic sample of getting a data value shows:


if (raeXML_NODE->xmlChildrenNode != NULL) {
value = xmlStrdup(XML_GET_CONTENT(raeXML_NODE->xmlChildrenNode));
}

So, the analysis engine library includes several macros that make writing the plugin parser look a little like a Basic program. It also includes macros to name the function and parameters. Et voila, writing a plugin parser is as easy as this:


int raePLUG_PARSER(raeXML_PARM)
{
GET_NEXT_ELEMENT;
IF_ELEMENT("Element_name")
{
WHILE_ATTR_LIST
{
  IF_ATTR("Attribute")
  {
    GET_ATTR(attr_value);
    {
      Process attr_value ...
    }
  }
}
GET_DATA(data_value);
Process data_value ...
}
GET_NEXT_NODE(status);
}

If you are thinking this should be contributed back to the libxml2 project, I don't think it would work. The Realeyes project only uses XML for configuration files with very simple syntax. Meanwhile, libxml2 handles the full range of XML capabilities. However, if you know someone who is working on an application and is annoyed (or annoying) about having to parse XML configuration files, point them to the Realeyes Analysis Engine subversion repository, where they can look at the XML parser source and include files.

The other type of hook/callback uses a function pointer. The reason for this is to make it optional. If the pointer is not initialized, then the callback function is not called. An example of this is the special handler for after an Analysis Record is built:

raeRecordHandler raeEventRecordHandler

erh_function (raeAnalysisRecord *rh_record)

The analysis engine library does all of the heavy lifting. Once the parsing is complete, plugins do not allocate any memory, unless there are specialized functions coded (I am particularly happy with the memory management, but that is a discussion for another post). There are library functions for managing multiple streams of data, matching values at specific locations in headers or strings in data, and building records for information that has been matched with a rule, just to name a few.

This makes the plugins fairly lightweight--the largest, the Action Analyzer, is just over 1,000 lines of code, most of which is parsing the options for collecting statistics. In fact, the statistics collection code, in a separate source file, is more than twice as large at over 2,200 lines of code, which gives a sense of how little the plugins have to do.

I gave a presentation to my local Linux User Group, and afterward one of the attendees talked to me about using it for some mathematical analysis he is involved in. I don't know if it will work for him, but I would be very happy if the library is found to be useful for other projects. The library is capable of handling multiple TCP sessions (35,000 simultaneously is the current peak), which are about as random as streams of data get, so it will certainly handle streams that are controlled. The output is created by a relatively simple plugin, which means it can be customized as much as necessary.

Later . . . Jim