ebpf user-space probes 原理探究

User-space probes 简称 Uprobes，它能够动态的介入应用程序的任意函数，采集调试和性能信息，且不引起混乱。目前，用户态探针有两种类型： uprobes 和 uretprobes（也叫 return 探针）。可以在应用程序的虚拟地址空间的任意指令上插入 uprobe，当用户函数返回的时候触发 uretprobe。

插入 uprobe 探针需要的信息，例如进程、插入位置、探针 handler 等可以通过注册函数进行指定，例如 register_uprobe()。uprobe 通过内核模块，ebpf 事件等方式进行工作，后文会简单尝试总结 epbf 使用 uprobe 的工作原理，在这之前先看一下 uprobe 的工作原理。

`uprobe` 工作原理

uprobe

当一个 uprobe 被注册后，Uprobes 会创建一个被探测指令的副本，停止被探测的应用程序，用断点指令替换被探测指令的首字节（在 i386 和 x86_64 上是 int3），之后让应用程序继续运行。（在插入断点的时候，Uprobes 使用与 ptrace 使用的相同的 copy on write 机制，这样断点也只影响那个进程，不会影响其他运行相同程序的进程。甚至是被探测的指令在共享库中也一样。）

当 CPU 命中断点指令的时候，发生了一个软件中断 trap，CPU 用户模式的寄存器都被保存起来，产生了一个 SIGTRAP 信号。Uprobes 拦截 SIGTRAP 信号，找到关联的 uprobe。然后，用 uprobe 结构体和先前保存的寄存器地址调用与 uprobe 关联的回调函数。这个回调函数可能会阻塞，但要记住回调函数执行期间，被探测的线程一直是停止的。

接下来，Uprobes 会单步执行被探测指令的副本，之后会恢复被探测的程序，让它在探测点之后的指令处继续执行。（实际上单步执行原始指令会更简单，但之后，Uprobes 必须移除断点指令。这在多线程应用程序中会引起问题。比如，当另一个线程执行过探测点的时会打开一个时间窗口。）

被单步执行的指令副本存储在每个进程的"单步跳出（SSOL）区域"中，它是由 Uprobes 在每个被探测进程的地址空间中创建的很小的 VM 区域。

utrace

对于同一个应用进程有多个 uprobe 探针的时候，Uprobes 用 Utrace 为进程中每个线程建立了一个追踪"引擎"。Uprobes 使用 Utrace “静默"机制，在插入或移除断点之前停止所有线程。Utrace 在被探测进程的生命周期中（fork, clone, exec, exit），通知 Uprobes 断点和单步执行陷阱以及其他感兴趣的事件。注册或注销探针的时候，要等到 Utrace 停止了进程中的所有线程后，才会插入或删除断点。注册/注销函数在断点已经被插入或移除之后才返回

uretprobe

如果想使用 uretprobe 探针，需要调用 register_uretprobe() 函数，此时 Uprobes 在函数的入口处创建一个 uprobe ，当调用被探测函数的时候命中这个探针，Uprobes 会保存 return 地址的一个副本，然后用"蹦床"的地址替换 return 地址（一段包含一个断点指令代码）。蹦床存储在 SSOL 区域中。

当被探测的函数执行它的 return 指令时，控制转移到蹦床，命中断点。Uprobes 的蹦床回调函数调用与 uretprobe 关联的回调函数，然后把已保存的指令指针设置为已保存的 return 地址，再然后就从 trap 返回后的地方恢复执行。

多线程支持

Uprobes 支持多线程应用的探测。Uprobes 在被探测的应用中没有线程数量的限制。在单个进程中的所有线程，使用相同的进程资源（上下文），所以进程中的每个探针，会影响所有线程，另外每个线程命中探测点（以及运行回调函数）是相对独立的。多个线程可能同时运行相同的回调函数。如果你想要一个特定的线程或是一组线程运行一个特定的回调函数，那回调函数应该检查 current 或 current->pid 来确认哪个线程命中了探测点。当进程克隆一个新的线程时，该线程自动的共享所有为进程创建的探针。

uprobe 和 ebpf

uprobe 作为内核提供的一种收集用户态程序运行信息的框架，以前需要通过 kernel module 开发来实现（主要是 handler 即回调函数）。ebpf 出现之后重新定义了 kernel 开发的方式，所以这里尝试整理一下，如何通过 ebpf 开发实现利通 uprobe 探针动态跟踪用户进程的信息收集。

ebpf 监听 uprobe events 原理

ebpf 通过勾子来过滤识别感兴趣的事件，例如系统调用事件等，勾子原理简单理解就是：

每个函数编译后地址的前 5 个字节都是 callq function+0x5，将函数入口地址的前5个字节修改成 jmp Hook_ptr 即可实现事件触发点（Hook_ptr 不是勾子函数地址，需要考虑字节保留和堆栈平衡等影响）

只要我们替换函数的入口为一个断点指令（int3），然后在断点处理程序中调用定制的监听程序，之后再调用实际的原程序即可完成通过 epbf 监听 uprobe 事件。

ebpf 设置 uprobe 探针原理

如下跟踪 goroutine 创建的 uprobe 探针工具实现：

package main

import (
	"encoding/binary"
	"flag"
	"fmt"
	"os"
	"os/signal"

	"github.com/iovisor/gobpf/bcc"
)

const bpfProgram = `
#include <uapi/linux/ptrace.h>

BPF_PERF_OUTPUT(trace);

typedef struct {
	int num;
	long fn_ptr;
}newproc_args;

// This function will be registered to be called everytime
// runtime.newproc is called.
inline int newprocCalled(struct pt_regs *ctx) {
  // function address
  long val = ctx->ax;
  trace.perf_submit(ctx, &val, sizeof(val));

  /*
  void* stackAddr = (void*)ctx->sp;
  newproc_args event = {};
  bpf_probe_read(&event.num, sizeof(event.num), stackAddr+8);
  bpf_probe_read(&event.fn_ptr, sizeof(event.fn_ptr), stackAddr+16);

  trace.perf_submit(ctx, &event, sizeof(event));
  */

  return 0;
}
`

var binaryProg string

func init() {
	flag.StringVar(&binaryProg, "binary", "", "The binary to probe")
}

func main() {
	flag.Parse()
	if len(binaryProg) == 0 {
		panic("Argument --binary needs to be specified")
	}

	bccMod := bcc.NewModule(bpfProgram, []string{})
	uprobeFD, err := bccMod.LoadUprobe("newprocCalled")
	if err != nil {
		panic(err)
	}

	// Attach the uprobe to be called everytime main.computeE is called.
	// We need to specify the path to the binary so it can be patched.
	err = bccMod.AttachUprobe(binaryProg, "runtime.newproc", uprobeFD, -1)
	if err != nil {
		panic(err)
	}

	// Create the output table named "trace" that the BPF program writes to.
	table := bcc.NewTable(bccMod.TableId("trace"), bccMod)
	ch := make(chan []byte)

	pm, err := bcc.InitPerfMap(table, ch, nil)
	if err != nil {
		panic(err)
	}

	// Watch Ctrl-C so we can quit this program.
	intCh := make(chan os.Signal, 1)
	signal.Notify(intCh, os.Interrupt)

	pm.Start()
	defer pm.Stop()

	for {
		select {
		case <-intCh:
			fmt.Println("Terminating")
			os.Exit(0)
		case v := <-ch:
			// This is a bit of hack, but we know that iterations is a
			// 8 bytes int64 value.
			fmt.Println("get perf event ", v)
			d := binary.LittleEndian.Uint64(v)
			fmt.Printf("Value = %x\n", d)
		}
	}
}

大致流程如下：

编写探针回调函数即 bpf 程序，获取事件中感兴趣的数据，提交到数据通路（例如 perf buffer）
attach_uprobe 加载 bpf 程序，同时设置感兴趣的函数符号（如runtime.newproc）和回调函数（如 newprocCalled)
监听 perf buffer 获取相应事件输出

具体的原理如下 attach_uprobe 代码所示：

StatusTuple BPF::attach_uprobe(const std::string& binary_path,
                               const std::string& symbol,
                               const std::string& probe_func,
                               uint64_t symbol_addr,
                               bpf_probe_attach_type attach_type, pid_t pid,
                               uint64_t symbol_offset,
                               uint32_t ref_ctr_offset) {

  if (symbol_addr != 0 && symbol_offset != 0)
    return StatusTuple(-1,
             "Attachng uprobe with addr %lx and offset %lx is not supported",
             symbol_addr, symbol_offset);

  std::string module;
  uint64_t offset;
  TRY2(check_binary_symbol(binary_path, symbol, symbol_addr, module, offset,
                           symbol_offset));

  std::string probe_event = get_uprobe_event(module, offset, attach_type, pid);
  if (uprobes_.find(probe_event) != uprobes_.end())
    return StatusTuple(-1, "uprobe %s already attached", probe_event.c_str());

  int probe_fd;
  TRY2(load_func(probe_func, BPF_PROG_TYPE_KPROBE, probe_fd));

  int res_fd = bpf_attach_uprobe(probe_fd, attach_type, probe_event.c_str(),
                                 binary_path.c_str(), offset, pid,
                                 ref_ctr_offset);

  if (res_fd < 0) {
    TRY2(unload_func(probe_func));
    return StatusTuple(
        -1,
        "Unable to attach %suprobe for binary %s symbol %s addr %lx "
        "offset %lx using %s\n",
        attach_type_debug(attach_type).c_str(), binary_path.c_str(),
        symbol.c_str(), symbol_addr, symbol_offset, probe_func.c_str());
  }

  open_probe_t p = {};
  p.perf_event_fd = res_fd;
  p.func = probe_func;
  uprobes_[probe_event] = std::move(p);
  return StatusTuple::OK();
}

bpf_attach_uprobe 是通过读写 tracing debugfs 接口实现 uprobe 相应的配置，具体如下代码所示：

// config1 could be either kprobe_func or uprobe_path,
// see bpf_try_perf_event_open_with_probe().
static int bpf_attach_probe(int progfd, enum bpf_probe_attach_type attach_type,
                            const char *ev_name, const char *config1, const char* event_type,
                            uint64_t offset, pid_t pid, int maxactive,
                            uint32_t ref_ctr_offset)
{
  int kfd, pfd = -1;
  char buf[PATH_MAX], fname[256];
  bool is_kprobe = strncmp("kprobe", event_type, 6) == 0;

  if (maxactive <= 0)
    // Try create the [k,u]probe Perf Event with perf_event_open API.
    pfd = bpf_try_perf_event_open_with_probe(config1, offset, pid, event_type,
                                             attach_type != BPF_PROBE_ENTRY,
                                             ref_ctr_offset);

  // If failed, most likely Kernel doesn't support the perf_kprobe PMU
  // (e12f03d "perf/core: Implement the 'perf_kprobe' PMU") yet.
  // Try create the event using debugfs.
  if (pfd < 0) {
    if (create_probe_event(buf, ev_name, attach_type, config1, offset,
                           event_type, pid, maxactive) < 0)
      goto error;

    // If we're using maxactive, we need to check that the event was created
    // under the expected name.  If debugfs doesn't support maxactive yet
    // (kernel < 4.12), the event is created under a different name; we need to
    // delete that event and start again without maxactive.
    if (is_kprobe && maxactive > 0 && attach_type == BPF_PROBE_RETURN) {
      if (snprintf(fname, sizeof(fname), "%s/id", buf) >= sizeof(fname)) {
        fprintf(stderr, "filename (%s) is too long for buffer\n", buf);
        goto error;
      }
      if (access(fname, F_OK) == -1) {
        // Deleting kprobe event with incorrect name.
        kfd = open("/sys/kernel/debug/tracing/kprobe_events",
                   O_WRONLY | O_APPEND, 0);
        if (kfd < 0) {
          fprintf(stderr, "open(/sys/kernel/debug/tracing/kprobe_events): %s\n",
                  strerror(errno));
          return -1;
        }
        snprintf(fname, sizeof(fname), "-:kprobes/%s_0", ev_name);
        if (write(kfd, fname, strlen(fname)) < 0) {
          if (errno == ENOENT)
            fprintf(stderr, "cannot detach kprobe, probe entry may not exist\n");
          else
            fprintf(stderr, "cannot detach kprobe, %s\n", strerror(errno));
          close(kfd);
          goto error;
        }
        close(kfd);

        // Re-creating kprobe event without maxactive.
        if (create_probe_event(buf, ev_name, attach_type, config1,
                               offset, event_type, pid, 0) < 0)
          goto error;
      }
    }
  }
  // If perf_event_open succeeded, bpf_attach_tracing_event will use the created
  // Perf Event FD directly and buf would be empty and unused.
  // Otherwise it will read the event ID from the path in buf, create the
  // Perf Event event using that ID, and updated value of pfd.
  if (bpf_attach_tracing_event(progfd, buf, pid, &pfd) == 0)
    return pfd;

error:
  bpf_close_perf_event_fd(pfd);
  return -1;
}

uprobe 工作原理