Monday, October 22, 2007

Kenel Debugging Techniques

This is an important topic to discuss the issue of debugging with Linux Kernel. What is the best way to debug the kernel? When facing with a race condition, crash, or freeze that we have no control of? In this topic I'll try to explore the various ways of debugging the kernel. Let me first list all the existing techniques and go through the details of each (let me know if there's other I'm not aware of). Followed by an explanation of which one I consider the best for debugging (personal view).

- Logging or debug trace through kdprint messages (/var/log/messages) dmesg
- Step through the kernel code using gdb debugger
- Using a serial port to debug from another machine
- Reading the kernel source code
- Sending system interrupt control when frozen
- Analyze oops message, dump object file

My point of view for the best way of debugging a kernel is put trace logs. The trace allows you to know the last point where the crash occurs:

You can use dmesg to print the kernel ring buffer or simply tail the /var/log/messages that's where most kernel kdprint message goes. Similar to printf, kdprint prints out the kernel message. This technique helps during debug mode as you don't want to print traces all around.

Using gdb is harder for kernel debugging, anyone tried? You also need to build a debug version of the kernel if I'm not mistaken.

Using a serial port I never tried that but worth trying.

Sometimes you may get a BUG from the kernel source, due to your code passing some wrong values. It's like an ASSERT from the kernel. This is easy to solve as you can directly read the source code where the BUG occurs. In many cases it may be dangerous reading the kernel source as some may be too tempted to patch it so it won't BUG anymore. My point of view the kernel source is for you to understand the behavior and not to patch. Until you understand it don't patch anything...

Another way of debugging is to analyze a oops message. Linux generates oops message caused by segmentation fault, etc... You can also try to generate an oops message when the machine does not respond anymore, in that case the kernel may be in deadlock scenario. You can configure the kernel to respond to some interrupt keys like ctrl-sysrq.

What is an oops message. oops message is similar to kernel dump file in Windows not as much information though, but basically you may find your last EIP address to trace back the last call before the crash.

If your oops message is not detailed enough, you can use the ksymoops command on linux to generate more detailed information. With an oops file, one can also dump the object file to figure out the last access line of code before the crash. A nice command in linux is the objdump command.

That's enough for today, more on this later......

No comments: