[FreeBSD] Kernel Tracing Facilities, DTrace and KTR

2024. 7. 16. 16:35ComputerScience/FreeBSD

 

 

 

3.8 Kernel Tracing Facilities

Operating-system kernels are large and complex pieces of software, encompassing thousands of lines of mainly C code, organized into dozens of subsystems and including hundreds of device drivers. Understanding what is happening within the operating system while it is running is important not only to developers of the code but also to the much larger group of people who use the system every day to get their work done. FreeBSD includes several facilities that allows users and administrators of the system to understand what is happening inside the system as it executes.

 

 

 

System-Call Tracing

The ktrace facility allows a user to get a detailed trace of the order of, arguments to, and results from all the system calls done by an application. This information includes such important details as pathnames being looked up, type and timing of signals being posted, and even the contents of all input and output operations.

 

The facility is available for any application without the need for prior compilation or inclusion of special hooks. Thus, it can be particularly helpful when trying to debug an application for which the source code is unavailable. Tracing can be flexibly applied. It can be started at the time an application begins running or to an already running application. It can be applied to individual processes, to all the processes in a process group, or through inheritance to all current or future children processes. (다양한 대상을 바탕으로 tracing 가능)

 

The traces are generated in a compact binary format to keep them as dense as possible. The use of a binary format also minimizes the time spent collecting and writing the information while the application runs and avoids the need to do string processing in the kernel. The binary dump is converted to a human-readable format using the kdump program that converts the system call numbers to names, ioctl values to their macro names, system error numbers to their standard error strings, and displays the running time between and during system calls.

 

A similar facility is provided by the truss command popularized in System V and Solaris. Rather than having specialized hooks inside the kernel to collect tracing information, it collects its information by stopping and restarting the processes being monitored using the ptrace system call. As a result, it has higher overhead and provides less information than ktrace.

 

 

 

DTrace

The information collected by ktrace is limited to the set of information available from its fixed set of hooks in the kernel. Further, this set of hooks has to be limited to those pieces of information deemed to(고려되다) be most generally useful. (일반적으로 많이 사용될 정보들) If ktrace collected every possibly useful bit of information, it would generate an overwhelming amount of data for even trivial applications.

Its other major limitation is that it collects information only about system calls. When trying to track down bugs or performance problems, one needs to analyze the entire software stack that includes the application itself, the libraries that it uses, and the system calls that it makes. The DTrace facility was developed to address all these issues [Cantrill et al., 2004]. Originally written for Sun’s Solaris operating system, DTrace support was added to FreeBSD 8, and is present by default in kernels starting with FreeBSD 10.

 

DTrace greatly expands the set of information from system calls by adding thousands of hooks, referred to as tracepoints, that can identify many details of what is happening. To avoid the resulting avalanche of data (갑자기 덮쳐오는 것), each tracepoint can be configured to conditionally collect and output its information.

DTrace defines the D language that allows application developers and systems administrators to write a small D-language program to describe the information that they want to collect. They can specify the tracepoint that they are interested in inspecting and refine the information that they output. For example, a D program might monitor a routine that changes a reference counter to collect the highest value that it ever reached, or collect the total number of times that a resource was referenced rather than just blindly outputting some information every time the routine is called.

Only those tracepoints that are useful to the analysis are activated, while all others are left dormant.  With its ability to trigger only a small number of tracepoints at anyone one time, DTrace can collect detailed information on a narrow set of interesting events with low overhead and carefully bounded output.

 

The tracing information is expanded into the rest of the software stack by adding DTrace tracepoints to the system libraries and to the application itself. A standard set of tracepoints is available with no programming effort required by the library or application developers. The standard tracepoints are the set of all functions in a library, application, or the kernel itself, and include the ability to capture parameter information at every subroutine call and return. In addition to these function boundary tracepoints, many system libraries have had additional application-specific hooks added by their developers.

 

The tracing functionality in DTrace is implemented using probes and providers. A probe is a specific tracepoint within the running kernel, such as a function boundary, while a provider is a kernel module that supports a set of probes. The DTrace system was designed to be extended so that new kernel modules or services could have tracepoints that were not envisioned (구상된) by the original authors. The providers found in FreeBSD are listed in Table 3.2. The mac_framework, sched, and vfs providers are unique to FreeBSD.

 

 

 

FreeBSD DTrace providers

 

 

 

Before the advent of DTrace, any system to which logging or tracing was added demonstrated a significant probe effect. Consider what happens when a function contains a conditional statement that determines whether to report a statistic. Whether or not the statistic is reported, the effect of having the conditional statement in the function, compared to no conditional statement being present at all, is measurable. The overhead seen by having the conditional statement is called the probe effect. (조건문으로 인한 오버헤드를 probe effect라고 한다) 

 

DTrace implements its probes by patching the executable when a probe is activated. For example, to monitor a call to a routine, the first instruction of the called routine is replaced with a call to the probe. The probe then collects its information, executes the instruction its call replaced, and returns to the instruction following the one that it patched. When the probe is deactivated, the original patched-instruction location with the call to the probe is replaced with the instruction that was originally at that location. This technique avoids any probe effect for inactive probes.

-> pinatrace 기법과 유사한 것 같다.!

https://software.intel.com/sites/landingpage/pintool/docs/98484/Pin/html/

 

Pin: Pin 3.21 User Guide

 

software.intel.com

 

 

 

Debuggers, such as lldb, use the same technique to add breakpoints to the program they are debugging. (Apple에서 기본으로 제공했던 디버깅 툴!) Patching an instruction stream that is currently executing is both tricky to do safely, and an explicit violation of the security and privacy of data structures on the system; thus, super-user privileges are required to enable kernel probes. DTrace is safe to use in production systems as it ensures that its patches to the program or kernel cannot cause either to behave incorrectly. An example of a precaution taken by DTrace to ensure safe operation is that it will not instrument a jump table, an operation that could easily lead to a system failure. 

 

As part of the kernel build process, a separate program, ctfconvert, is executed against all the kernel’s object files to generate updated object files that can be understood by DTrace. The ctfconvert program takes information from the debugging section of an object file and creates a new section .SUNW_ctf that contains type information for each function’s arguments that can be used by DTrace. Every type that appears in the .SUNW_ctf section of the object file is converted into data that can be used by the function boundary trace provider to associate data types with function arguments, allowing userspace D scripts to interrogate function arguments in much the same way as the debugger allows a programmer to inspect program data with associated type information.

 

All functions found in the debugging section are exposed to the user as individual trace-points. The FreeBSD kernel contains over 45,000 function boundary tracepoints, or fbt probes, each of which can be triggered on entry to or exit from a routine in the kernel. Because the generation of the function boundary tracing probes happens automatically at system build time, new probes come into existence whenever new code is added to the FreeBSD kernel, relieving developers of the need to add tracepoints to the system by hand, and to keep the tracepoints up to date with other code changes.

Function boundary tracepoints can change with each release of the operating system, which means that function boundary tracepoints may change or disappear, so cannot be depended upon in scripts when upgrading across major releases, for example, from FreeBSD 9 to FreeBSD 10.

Compiler optimization or the redefinition of a function as static can also cause a function boundary tracepoint to disappear. Statically defined tracepoints (SDT) do not change unless they are specifically updated by a programmer and are therefore considered stable across major releases. 

 

 

 

Tracepoints are reached through DTrace providers, each of which is a kernel module that exposes a uniform interface to the rest of the kernel. All DTrace providers expose a uniform API via a set of function pointers embedded in a DTrace-provider operations structure, dtrace_pops

 

Each provider registers itself with the DTrace system by calling the dtrace_register() routine that allows DTrace to track all the available providers and expose them to the user through the D programming language. The dtrace_register() routine passes the dtrace_pops structure for the provider as one of its arguments. Provider operations include enabling, disabling, suspending, and resuming probes, as well as retrieving argument names and values, from within the probe.

 

DTrace understands not only functions but also basic types, such as integers, strings, and programmer-defined structures. When the fbt module is loaded, tracepoints are created for the kernel using the kernel’s linker functions. The fbt_provide_module_function() routine is responsible for disassembling the entry and exit points of each function in the kernel as well as all the loaded modules, building up a list of fbt_probe_t structures that contain the address of the functions that can be probed.

 

The fbt_probe_t structure contains three key components used by DTrace when turning tracing on or off.

The fbtp_patchpoint is the address of the instruction that needs to be replaced when tracing is enabled or disabled. When the fbt_provide_module_function() runs, it determines the address of the instruction that must later be replaced with a function call to the DTrace system during active tracing. The address of that instruction is stored in fbtp_patchpoint.

At the same time, the instruction that must be replaced during tracing is put into the fbtp_savedval element of the structure and the instruction that will be used to cause the entry point to change is placed into fbtp_patchval. Whenever the DTrace command enables a tracepoint, the fbtp_patchpoint is set to the instruction stored in fbtp_patchval.

When tracing is turned off, the instruction stored in fbtp_savedval is again placed back into fbtp_patchpoint. Storing the instructions during module load time makes the fbt_enable() and fbt_disable() routines shorter and safer than they would be if the functions to be traced had to be disassembled whenever the user enabled a trace-point.

 

While the automatic generation of tracepoints for functions is a powerful feature of DTrace, the ability to add specific tracepoints not associated with a function boundary is an important part of the system. Any subsystem or collection of subsystems can be encapsulated as a provider to be monitored by developers and administrators. The collection of locking statistics for individual threads is an example of a DTrace provider that was written by hand into the kernel’s source code. The DTrace lockstat provider piggybacks on the preexisting lockstat statistics collection by adding a probe to each of the lockstat macros.

 

The FreeBSD kernel provides several synchronization primitives, referred to collectively as locks, that are described fully in Section 4.3. To track the locking and unlocking actions of the kernel at a low level, it is necessary to know more than just the lock that is being requested. We often want to know if it was necessary to wait for the lock, and if so, which other thread blocked us from getting it. The information we want from a locking operation is not evident from the parameters to the lock call, but is embedded within the code implementing the lock.

Locking statistics are collected by manually placing macros at key points in the locking-implementation code to record data about the lock when it is acquired or released. These locking statistics reside in the lockstat provider. The macros for collecting the lockstat statistics are written generically enough that they can be used in all the locking primitives in the system.

When a lock is acquired, the macro is given a pointer to the lock object, a Boolean flag telling whether the lock was contested, the time that the thread started waiting for the the lock, and the kernel file name and the line number within the function from which the lock primitive was called. The macros call an appropriate function to collect statistics on how often the lock is acquired, average time that it is held, how often it is contested, and when it is contested, as well as the average time that the blocked thread had to wait. These statistics identify the contested locks in the kernel. One important way of improving system performance is to use finer-grain locking for the most highly contested locks in the system. For example, a single global lock controlling access to a hash table might be replaced with one lock per hash chain.

 

The lockstat DTrace provider is only available if the lockstat lock profiling has been compiled into the kernel. Each probe is defined by a lockstat_probe structure that contains the function and name of the probe, which is exported to the user as well as a probe number and probe identifier that the lockstat provider uses when the probes fire.

 

The macro shown in Figure 3.4 is placed into various locking functions to record data whenever a lock is acquired. The lockstat information is collected by the lock_profile_obtain_lock_success() function. The remainder of the macro implements the lockstat provider probe. The probe is active if its entry in the lockstat_probemap is non-NULL, which occurs only when a probe is activated via the lockstat provider’s lockstat_enable() routine. 

Configuring the kernel to collect lockstat statistics introduces a fixed amount of overhead. The lockstat DTrace provider introduces a variable amount of overhead depending on which probes are active at any time.

 

 

 

Lockstat probe macro

 

 

 

Kernel Tracing

Any large software system usually includes a logging system to aid in debugging problems that arise after the software has been released and FreeBSD is no different. The kernel tracing facility (KTR) is a set of logging facilities that can be compiled into the kernel with a configuration option. This tracing facility primarily helps debug the kernel as compared to the other tracing described in this section that primarily helps debug user-level processes.

 

Before the addition of KTR, developers would sprinkle calls to printf() throughout their code, and conditionally compile them in or out of the kernel using

 

#ifdef DEBUG

printf()

#endif DEBUG

 

// 이러한 형식의 프로그래밍, 어제 코드 리뷰하면서 찾았던 부분.! 

// https://sohyeonkim-dev.tistory.com/329 

 

[PIM] CPU/DPU Programming Code Review

PrIM BenchmarksVertor Addition Code Reivew https://github.com/SohyeonKim-dev/prim-benchmarks GitHub - SohyeonKim-dev/prim-benchmarks: PrIM (Processing-In-Memory benchmarks) is the first benchmark suite for a real-world prPrIM (Processing-In-Memory benchm

sohyeonkim-dev.tistory.com

 

 

statements. The KTR system introduced a single logging system for the kernel that could be shared by the entire source base and centrally controlled from kernel configuration files.

 

Kernel trace events are described by a ktr_entry structure, shown in Figure 3.5. Each entry contains a timestamp, the CPU on which the event occurred, the file and line of source code from which the event was logged, a programmer specified description, a pointer to the thread that executed the event, and up to six parameters.

 

 

 

Kernel trace-entry structure

 

 

 

Calls into the kernel tracing system are implemented as a set of macros, shown in Figure 3.6. Unlike the printf() routine, the kernel tracing facility does not allow for a variable number of arguments. Variable argument parsing is computationally expensive, and is not appropriate for a kernel logging facility because the extra CPU time expended might prevent developers from catching timing-related problems.

 

When support for the KTR system is compiled into the kernel, there is a statically sized array, used as a circular buffer, that contains all the trace events. It is important to size the event buffer correctly as it is possible to overwrite entries before they are read out of the kernel if many events are logged in a short amount of time.

 

If enabling KTR recorded all the more than the 1700 defined events, the system would grind to a near halt. (멈춤) Thus, the rate of event generation is controlled by tagging each event with a programmer-defined event mask. The event mask collects a related set of events so that recording them can be turned on or off as a group.

When a kernel with KTR is first booted, the system’s event mask is cleared so that no events are recorded. The event mask is set either from user level, using the debug.ktr.mask sysctl, or in the boot configuration file /boot/loader.conf, if events are to be recorded from the time the system boots.

 

Kernel tracing events can be recorded to disk using the asynchronous logging facility which runs as a kernel resident thread. It reads events from the kernel trace buffer and writes them to a file specified by the user using a sysctl. Writing events to a file, whether on local disk or a remote filesystem, introduces extra I/O load onto the system, which may make it inappropriate for finding timing-related problems in the system.

 

 

 

Kernel tracing macros