Demonstrations of stackcount, the Linux eBPF/bcc version. This program traces functions and frequency counts them with their entire stack trace, summarized in-kernel for efficiency. For example, counting stack traces that led to the submit_bio() kernel function, which creates block device I/O: # ./stackcount submit_bio Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end. ^C submit_bio submit_bh journal_submit_commit_record.isra.13 jbd2_journal_commit_transaction kjournald2 kthread ret_from_fork mb_cache_list 1 submit_bio __block_write_full_page.constprop.39 block_write_full_page blkdev_writepage __writepage write_cache_pages generic_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb 2 submit_bio __block_write_full_page.constprop.39 block_write_full_page blkdev_writepage __writepage write_cache_pages generic_writepages do_writepages __filemap_fdatawrite_range filemap_fdatawrite fdatawrite_one_bdev 36 submit_bio submit_bh jbd2_journal_commit_transaction kjournald2 kthread ret_from_fork mb_cache_list 38 submit_bio ext4_writepages do_writepages __filemap_fdatawrite_range filemap_flush ext4_alloc_da_blocks ext4_rename ext4_rename2 vfs_rename sys_rename entry_SYSCALL_64_fastpath 79 Detaching... The output shows unique stack traces, in order from leaf (on-CPU) to root, followed by their occurrence count. The last stack trace in the above output shows syscall handling, ext4_rename(), and filemap_flush(): looks like an application issued file rename has caused back end disk I/O due to ext4 block allocation and a filemap_flush(). Now adding the -P option to display stacks separately for each process: # ./stackcount -P submit_bio Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end. ^C submit_bio ext4_writepages do_writepages __filemap_fdatawrite_range filemap_flush ext4_alloc_da_blocks ext4_release_file __fput ____fput task_work_run exit_to_usermode_loop syscall_return_slowpath entry_SYSCALL_64_fastpath [unknown] [unknown] tar [15069] 5 submit_bio ext4_bio_write_page mpage_submit_page mpage_map_and_submit_buffers ext4_writepages do_writepages __filemap_fdatawrite_range filemap_flush ext4_alloc_da_blocks ext4_release_file __fput ____fput task_work_run exit_to_usermode_loop syscall_return_slowpath entry_SYSCALL_64_fastpath [unknown] [unknown] tar [15069] 15 submit_bio ext4_readpages __do_page_cache_readahead ondemand_readahead page_cache_async_readahead generic_file_read_iter __vfs_read vfs_read sys_read entry_SYSCALL_64_fastpath [unknown] tar [15069] 113 Detaching... The last stack trace in the above output shows syscall handling, sys_read(), vfs_read(), and then "readahead" functions: looks like an application issued file read has triggered read ahead. With "-P", the application can be seen after the stack trace, in this case, "tar [15069]" for the "tar" command, PID 15069. The order of printed stack traces is from least to most frequent. The most frequent in this case, the ext4_readpages() stack, was taken 113 times during tracing. The "[unknown]" frames are from user-level, since this simple workload is the tar command, which apparently has been compiled without frame pointers. It's a common compiler optimization, but it breaks frame pointer-based stack walkers. Similar broken stacks will be seen by other profilers and debuggers that use frame pointers. Hopefully your application preserves them so that the user-level stack trace is visible. So how does one get frame pointers, if your application doesn't have them to start with? For the current bcc (until it supports other stack walkers), you need to be running an application binaries that preserves frame pointers, eg, using gcc's -fno-omit-frame-pointer. That's about all I'll say here: this is a big topic that is not bcc/BPF specific. It can be useful to trace the path to submit_bio to explain unusual rates of disk IOPS. These could have in-kernel origins (eg, background scrub). Now adding the -d option to delimit kernel and user stacks: # ./stackcount -P -d submit_bio Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end. ^C submit_bio submit_bh journal_submit_commit_record jbd2_journal_commit_transaction kjournald2 kthread ret_from_fork -- jbd2/xvda1-8 [405] 1 submit_bio submit_bh jbd2_journal_commit_transaction kjournald2 kthread ret_from_fork -- jbd2/xvda1-8 [405] 2 submit_bio ext4_writepages do_writepages __filemap_fdatawrite_range filemap_flush ext4_alloc_da_blocks ext4_release_file __fput ____fput task_work_run exit_to_usermode_loop syscall_return_slowpath entry_SYSCALL_64_fastpath -- [unknown] [unknown] tar [15187] 5 submit_bio ext4_bio_write_page mpage_submit_page mpage_map_and_submit_buffers ext4_writepages do_writepages __filemap_fdatawrite_range filemap_flush ext4_alloc_da_blocks ext4_release_file __fput ____fput task_work_run exit_to_usermode_loop syscall_return_slowpath entry_SYSCALL_64_fastpath -- [unknown] [unknown] tar [15187] 15 submit_bio ext4_readpages __do_page_cache_readahead ondemand_readahead page_cache_async_readahead generic_file_read_iter __vfs_read vfs_read sys_read entry_SYSCALL_64_fastpath -- [unknown] [unknown] [unknown] tar [15187] 171 Detaching... A "--" is printed between the kernel and user stacks. As a different example, here is the kernel function hrtimer_init_sleeper(): # ./stackcount.py -P -d hrtimer_init_sleeper Tracing 1 functions for "hrtimer_init_sleeper"... Hit Ctrl-C to end. ^C hrtimer_init_sleeper do_futex SyS_futex entry_SYSCALL_64_fastpath -- [unknown] containerd [16020] 1 hrtimer_init_sleeper do_futex SyS_futex entry_SYSCALL_64_fastpath -- __pthread_cond_timedwait Monitor::IWait(Thread*, long) Monitor::wait(bool, long, bool) CompileQueue::get() CompileBroker::compiler_thread_loop() JavaThread::thread_main_inner() JavaThread::run() java_start(Thread*) start_thread java [4996] 1 hrtimer_init_sleeper do_futex SyS_futex entry_SYSCALL_64_fastpath -- [unknown] [unknown] containerd [16020] 1 hrtimer_init_sleeper do_futex SyS_futex entry_SYSCALL_64_fastpath -- __pthread_cond_timedwait VMThread::loop() VMThread::run() java_start(Thread*) start_thread java [4996] 3 hrtimer_init_sleeper do_futex SyS_futex entry_SYSCALL_64_fastpath -- [unknown] dockerd [16008] 4 hrtimer_init_sleeper do_futex SyS_futex entry_SYSCALL_64_fastpath -- [unknown] [unknown] dockerd [16008] 4 hrtimer_init_sleeper do_futex SyS_futex entry_SYSCALL_64_fastpath -- __pthread_cond_timedwait Lio/netty/util/ThreadDeathWatcher$Watcher;::run Interpreter Interpreter call_stub JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*) JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*) JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*) thread_entry(JavaThread*, Thread*) JavaThread::thread_main_inner() JavaThread::run() java_start(Thread*) start_thread java [4996] 4 hrtimer_init_sleeper do_futex SyS_futex entry_SYSCALL_64_fastpath -- __pthread_cond_timedwait clock_gettime [unknown] java [4996] 79 Detaching... I was just trying to find a more interesting example. This output includes some Java stacks where user-level has been walked correctly (even includes a JIT symbol translation). dockerd and containerd don't have frame pointers (grumble), but Java does (which is running with -XX:+PreserveFramePointer). Here's another kernel function, ip_output(): # ./stackcount.py -P -d ip_output Tracing 1 functions for "ip_output"... Hit Ctrl-C to end. ^C ip_output ip_queue_xmit tcp_transmit_skb tcp_write_xmit __tcp_push_pending_frames tcp_push tcp_sendmsg inet_sendmsg sock_sendmsg sock_write_iter __vfs_write vfs_write SyS_write entry_SYSCALL_64_fastpath -- __write_nocancel [unknown] sshd [15015] 5 ip_output ip_queue_xmit tcp_transmit_skb tcp_write_xmit __tcp_push_pending_frames tcp_push tcp_sendmsg inet_sendmsg sock_sendmsg sock_write_iter __vfs_write vfs_write SyS_write entry_SYSCALL_64_fastpath -- __write_nocancel [unknown] [unknown] sshd [8234] 5 ip_output ip_queue_xmit tcp_transmit_skb tcp_write_xmit __tcp_push_pending_frames tcp_push tcp_sendmsg inet_sendmsg sock_sendmsg sock_write_iter __vfs_write vfs_write SyS_write entry_SYSCALL_64_fastpath -- __write_nocancel sshd [15015] 7 Detaching... This time just sshd is triggering ip_output() calls. Watch what happens if I filter on kernel stacks only (-K) for ip_output(): # ./stackcount.py -K ip_output Tracing 1 functions for "ip_output"... Hit Ctrl-C to end. ^C ip_output ip_queue_xmit tcp_transmit_skb tcp_write_xmit __tcp_push_pending_frames tcp_push tcp_sendmsg inet_sendmsg sock_sendmsg sock_write_iter __vfs_write vfs_write SyS_write entry_SYSCALL_64_fastpath 13 Detaching... They have grouped together as a single unique stack, since the kernel part was the same. Here is just the user stacks, fetched during the kernel function ip_output(): # ./stackcount.py -P -U ip_output Tracing 1 functions for "ip_output"... Hit Ctrl-C to end. ^C [unknown] snmpd [1645] 1 __write_nocancel [unknown] [unknown] sshd [8234] 3 __write_nocancel sshd [15015] 4 I should really run a custom sshd with frame pointers so we can see its stack trace... User-space functions can also be traced if a library name is provided. For example, to quickly identify code locations that allocate heap memory for PID 4902 (using -p), by tracing malloc from libc ("c:malloc"): # ./stackcount -p 4902 c:malloc Tracing 1 functions for "malloc"... Hit Ctrl-C to end. ^C malloc rbtree_new main [unknown] 12 malloc _rbtree_node_new_internal _rbtree_node_insert rbtree_insert main [unknown] 1189 Detaching... Kernel stacks are absent as this didn't enter kernel code. Note that user-space uses of stackcount can be somewhat more limited because a lot of user-space libraries and binaries are compiled without frame-pointers as discussed earlier (the -fomit-frame-pointer compiler default) or are used without debuginfo. In addition to kernel and user-space functions, kernel tracepoints and USDT tracepoints are also supported. For example, to determine where threads are being created in a particular process, use the pthread_create USDT tracepoint: # ./stackcount -P -p $(pidof parprimes) u:pthread:pthread_create Tracing 1 functions for "u:pthread:pthread_create"... Hit Ctrl-C to end. ^C parprimes [11923] pthread_create@@GLIBC_2.2.5 main __libc_start_main [unknown] 7 You can use "readelf -n file" to see if it has USDT tracepoints. Similarly, to determine where context switching is happening in the kernel, use the sched:sched_switch kernel tracepoint: # ./stackcount -P t:sched:sched_switch __schedule schedule worker_thread kthread ret_from_fork kworker/0:2 [25482] 1 __schedule schedule schedule_hrtimeout_range_clock schedule_hrtimeout_range ep_poll SyS_epoll_wait entry_SYSCALL_64_fastpath epoll_wait Lsun/nio/ch/SelectorImpl;::lockAndDoSelect Lsun/nio/ch/SelectorImpl;::select Lio/netty/channel/nio/NioEventLoop;::select Lio/netty/channel/nio/NioEventLoop;::run Interpreter Interpreter call_stub JavaCalls::call_helper(JavaValue*, methodHandle*, JavaCallArguments*, Thread*) JavaCalls::call_virtual(JavaValue*, KlassHandle, Symbol*, Symbol*, JavaCallArguments*, Thread*) JavaCalls::call_virtual(JavaValue*, Handle, KlassHandle, Symbol*, Symbol*, Thread*) thread_entry(JavaThread*, Thread*) JavaThread::thread_main_inner() JavaThread::run() java_start(Thread*) start_thread java [4996] 1 ... (omitted for brevity) __schedule schedule schedule_preempt_disabled cpu_startup_entry xen_play_dead arch_cpu_idle_dead cpu_startup_entry cpu_bringup_and_idle swapper/1 [0] 289 A -i option can be used to set an output interval, and -T to include a timestamp. For example: # ./stackcount.py -P -Tdi 1 submit_bio Tracing 1 functions for "submit_bio"... Hit Ctrl-C to end. 06:05:13 06:05:14 submit_bio xfs_do_writepage write_cache_pages xfs_vm_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork -- kworker/u16:1 [15686] 1 submit_bio process_one_work worker_thread kthread ret_from_fork -- kworker/u16:0 [16007] 1 submit_bio xfs_buf_submit xlog_bdstrat xlog_sync xlog_state_release_iclog _xfs_log_force xfs_log_force xfs_fs_sync_fs sync_fs_one_sb iterate_supers sys_sync entry_SYSCALL_64_fastpath -- [unknown] sync [16039] 1 submit_bio submit_bh journal_submit_commit_record jbd2_journal_commit_transaction kjournald2 kthread ret_from_fork -- jbd2/xvda1-8 [405] 1 submit_bio process_one_work worker_thread kthread ret_from_fork -- kworker/0:2 [25482] 2 submit_bio ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork -- kworker/u16:0 [16007] 4 submit_bio xfs_vm_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork -- kworker/u16:1 [15686] 5 submit_bio __block_write_full_page block_write_full_page blkdev_writepage __writepage write_cache_pages generic_writepages blkdev_writepages do_writepages __filemap_fdatawrite_range filemap_fdatawrite fdatawrite_one_bdev iterate_bdevs sys_sync entry_SYSCALL_64_fastpath -- [unknown] sync [16039] 7 submit_bio submit_bh jbd2_journal_commit_transaction kjournald2 kthread ret_from_fork -- jbd2/xvda1-8 [405] 8 submit_bio ext4_bio_write_page mpage_submit_page mpage_map_and_submit_buffers ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork -- kworker/u16:0 [16007] 8 submit_bio __block_write_full_page block_write_full_page blkdev_writepage __writepage write_cache_pages generic_writepages blkdev_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork -- kworker/u16:0 [16007] 60 06:05:15 06:05:16 Detaching... This only included output for the 06:05:14 interval. The other internals did not span block device I/O. The -s output prints the return instruction offset for each function (aka symbol offset). Eg: # ./stackcount.py -P -s tcp_sendmsg Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end. ^C tcp_sendmsg+0x1 sock_sendmsg+0x38 sock_write_iter+0x85 __vfs_write+0xe3 vfs_write+0xb8 SyS_write+0x55 entry_SYSCALL_64_fastpath+0x1e __write_nocancel+0x7 sshd [15015] 3 tcp_sendmsg+0x1 sock_sendmsg+0x38 sock_write_iter+0x85 __vfs_write+0xe3 vfs_write+0xb8 SyS_write+0x55 entry_SYSCALL_64_fastpath+0x1e __write_nocancel+0x7 sshd [8234] 3 Detaching... If it wasn't clear how one function called another, knowing the instruction offset can help you locate the lines of code from a disassembly dump. The -v output is verbose, and shows raw addresses: ./stackcount.py -P -v tcp_sendmsg Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end. ^C ffffffff817b05c1 tcp_sendmsg ffffffff8173ea48 sock_sendmsg ffffffff8173eae5 sock_write_iter ffffffff81232b33 __vfs_write ffffffff812331b8 vfs_write ffffffff81234625 SyS_write ffffffff818739bb entry_SYSCALL_64_fastpath 7f120511e6e0 __write_nocancel sshd [8234] 3 ffffffff817b05c1 tcp_sendmsg ffffffff8173ea48 sock_sendmsg ffffffff8173eae5 sock_write_iter ffffffff81232b33 __vfs_write ffffffff812331b8 vfs_write ffffffff81234625 SyS_write ffffffff818739bb entry_SYSCALL_64_fastpath 7f919c5a26e0 __write_nocancel sshd [15015] 11 Detaching... A wildcard can also be used. Eg, all functions beginning with "tcp_send", kernel stacks only (-K) with offsets (-s): # ./stackcount -Ks 'tcp_send*' Tracing 14 functions for "tcp_send*"... Hit Ctrl-C to end. ^C tcp_send_delayed_ack0x1 tcp_rcv_established0x3b1 tcp_v4_do_rcv0x130 tcp_v4_rcv0x8e0 ip_local_deliver_finish0x9f ip_local_deliver0x51 ip_rcv_finish0x8a ip_rcv0x29d __netif_receive_skb_core0x637 __netif_receive_skb0x18 netif_receive_skb_internal0x23 1 tcp_send_delayed_ack0x1 tcp_rcv_established0x222 tcp_v4_do_rcv0x130 tcp_v4_rcv0x8e0 ip_local_deliver_finish0x9f ip_local_deliver0x51 ip_rcv_finish0x8a ip_rcv0x29d __netif_receive_skb_core0x637 __netif_receive_skb0x18 netif_receive_skb_internal0x23 4 tcp_send_mss0x1 inet_sendmsg0x67 sock_sendmsg0x38 sock_write_iter0x78 __vfs_write0xaa vfs_write0xa9 sys_write0x46 entry_SYSCALL_64_fastpath0x16 7 tcp_sendmsg0x1 sock_sendmsg0x38 sock_write_iter0x78 __vfs_write0xaa vfs_write0xa9 sys_write0x46 entry_SYSCALL_64_fastpath0x16 7 Detaching... Use -r to allow regular expressions. The -f option will emit folded output, which can be used as input to other tools including flame graphs. For example, with delimiters as well: # ./stackcount.py -P -df t:sched:sched_switch ^Csnmp-pass;[unknown];[unknown];[unknown];[unknown];[unknown];-;entry_SYSCALL_64_fastpath;SyS_select;core_sys_select;do_select;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule;__schedule 1 kworker/7:0;-;ret_from_fork;kthread;worker_thread;schedule;__schedule 1 watchdog/0;-;ret_from_fork;kthread;smpboot_thread_fn;schedule;__schedule 1 snmp-pass;[unknown];[unknown];[unknown];[unknown];[unknown];-;entry_SYSCALL_64_fastpath;SyS_select;core_sys_select;do_select;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule;__schedule 1 svscan;[unknown];-;entry_SYSCALL_64_fastpath;SyS_nanosleep;hrtimer_nanosleep;do_nanosleep;schedule;__schedule 1 python;[unknown];__select_nocancel;-;entry_SYSCALL_64_fastpath;SyS_select;core_sys_select;do_select;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule;__schedule 1 kworker/2:0;-;ret_from_fork;kthread;worker_thread;schedule;__schedule 1 [...] Flame graphs visualize stack traces. For information about them and links to open source software, see http://www.brendangregg.com/flamegraphs.html . This folded output can be piped directly into flamegraph.pl (the Perl version). USAGE message: # ./stackcount -h usage: stackcount [-h] [-p PID] [-c CPU] [-i INTERVAL] [-D DURATION] [-T] [-r] [-s] [-P] [-K] [-U] [-v] [-d] [-f] [--debug] pattern Count events and their stack traces positional arguments: pattern search expression for events optional arguments: -h, --help show this help message and exit -p PID, --pid PID trace this PID only -c CPU, --cpu CPU trace this CPU only -i INTERVAL, --interval INTERVAL summary interval, seconds -D DURATION, --duration DURATION total duration of trace, seconds -T, --timestamp include timestamp on output -r, --regexp use regular expressions. Default is "*" wildcards only. -s, --offset show address offsets -P, --perpid display stacks separately for each process -K, --kernel-stacks-only kernel stack only -U, --user-stacks-only user stack only -v, --verbose show raw addresses -d, --delimited insert delimiter between kernel/user stacks -f, --folded output folded format --debug print BPF program before starting (for debugging purposes) examples: ./stackcount submit_bio # count kernel stack traces for submit_bio ./stackcount -d ip_output # include a user/kernel stack delimiter ./stackcount -s ip_output # show symbol offsets ./stackcount -sv ip_output # show offsets and raw addresses (verbose) ./stackcount 'tcp_send*' # count stacks for funcs matching tcp_send* ./stackcount -r '^tcp_send.*' # same as above, using regular expressions ./stackcount -Ti 5 ip_output # output every 5 seconds, with timestamps ./stackcount -p 185 ip_output # count ip_output stacks for PID 185 only ./stackcount -p 185 c:malloc # count stacks for malloc in PID 185 ./stackcount t:sched:sched_fork # count stacks for sched_fork tracepoint ./stackcount -p 185 u:node:* # count stacks for all USDT probes in node ./stackcount -c 1 put_prev_entity # count put_prev_entity stacks for CPU 1 only ./stackcount -K t:sched:sched_switch # kernel stacks only ./stackcount -U t:sched:sched_switch # user stacks only