关于 golang 错误处理的一些优化想法

时间：2023-12-02 本站点击：0

0. 前言

由于笔者水平有限，文章中难免出现各种错误，欢迎吐槽。

由于篇幅所限，大部分代码细节并未讲的很清楚，如有疑问欢迎讨论。

1. 当前存在的问题

1.1 标准库的 error 信息量少

golang 自带的 errors.New() 和 fmt.Errorf() 都只能记录 string 信息，错误现场只能靠 log 提供，如果需要获取调用栈，需要额外的代码逻辑。即使后来增加了 "%w" 的支持也没有本质的改变。

令人欣喜的是，pkg/errors 补齐了这个短板，其 WithStack() 提供了完整的调用栈信息，且 Wrap() 也提供了更详细的 error 路径追踪能力。

1.2 pkg/errors 好用，但性能较差

我们简单给 pkg/errors 做个基准测试：

package errorsimport (    "errors"    "testing"    pkgerrs "github.com/pkg/errors")func deepCall(depth int, f func()) {    if depth <= 0 {        f()        return    }    deepCall(depth-1, f)}func BenchmarkPkg(b *testing.B) {    b.Run("pkg/errors", func(b *testing.B) {        b.ReportAllocs()        var err error        deepCall(10, func() {            for i := 0; i < b.N; i++ {                err = pkgerrs.New("ye error")                GlobalE = fmt.Sprintf("%+v", err)            }            b.StopTimer()        })    })    b.Run("errors-fmt", func(b *testing.B) {        b.ReportAllocs()        var err error        deepCall(10, func() {            for i := 0; i < b.N; i++ {                err = errors.New("ye error")                GlobalE = fmt.Sprintf("%+v", err)            }            b.StopTimer()        })    })    b.Run("errors-Errors", func(b *testing.B) {        b.ReportAllocs()        var err error        deepCall(10, func() {            for i := 0; i < b.N; i++ {                err = errors.New("ye error")                GlobalE = err.Error()            }            b.StopTimer()        })    })}/*结果如下：cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkPkgBenchmarkPkg/pkg/errorsBenchmarkPkg/pkg/errors-12    105876    10353 ns/op    2354 B/op    36 allocs/opBenchmarkPkg/errors-fmtBenchmarkPkg/errors-fmt-12    8025952    154.4 ns/op    24 B/op    2 allocs/opBenchmarkPkg/errors-ErrorsBenchmarkPkg/errors-Errors-12    42631442    26.23 ns/op    16 B/op    1 allocs/op*/

由基准测试结果可知，仅 10 个调用层级 pkg/errors 就需要损耗近 10us; 而 errors.New() 在同等条件下只损耗 150ns，如果用 Error() 输出甚至只要不到 30ns，差距巨大。这也是很多 gopher 犹豫是否用 pkg/errors 替换 errors.New() 和 fmt.Errorf() 的原因。

1.3 pkg/errors 信息冗余较严重

WithStack() 的冗余信息应该不算多，不过 Wrap() 接口还带上完整的调用栈就显得没那么必要了，因为大部分和 WithStack() 其实是重复的，而且如果多调用几次 Wrap() 很容易会造成日志超长。

1.4 panic(err) 方便，但有可能会造成程序退出

很多同学受不了 if err!=nil{ ... return } 的错误处理方式，转而使用更方便的 defer...panic(err) 方式。这种方式虽然可以方便的处理错误，不过也会带来致命的问题----如果忘记 defer recover() 将会造成整个程序的退出。也是因此，很多团队把 defer...panic(err) 方式列为了红线，严禁触碰。

2. 有什么改进想法？

可以看到 pkg/errors 实际上是使用 runtime.Callers 获取完整调用栈的，而该函数实际性能比较差，简单基准测试如下：

func BenchmarkStack(b *testing.B) {    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkStackBenchmarkStack/runtime.CallersBenchmarkStack/runtime.Callers-12    252842    4947 ns/op    1820 B/op    24 allocs/op*/

由测试结果可知，其获取 stack 的损耗接近 5us。

笔者之前的文章里也有涉及 runtime.Callers 的优化，该文章虽然仅优化了获取当前代码行的性能，不过其思路依然可以作为此次优化的参考。

2.1 通过缓存提升 pc 到代码行号的性能

package errorsimport (    "runtime"    "strconv"    "sync"    "unsafe"    "github.com/lxt1045/errors")const DefaultDepth = 32var (    cacheStack = errors.AtomicCache[[DefaultDepth]uintptr{}, []string]{}    pool       = sync.Pool{        New: func() any { return &[DefaultDepth]uintptr{} },    })func NewStack(skip int) (stack []string) {    pcs := pool.Get().(*[DefaultDepth]uintptr)    n := runtime.Callers(skip+2, pcs[:DefaultDepth-skip])    for i := n; i < DefaultDepth; i++ {        pcs[i] = 0    }    stack = cacheStack.Get(*pcs)    if len(stack) == 0 {        stack = parseSlow(pcs[:n])        cacheStack.Set(*pcs, stack)    }    pool.Put(pcs)    return}func parseSlow(pcs []uintptr) (cs []string) {    traces, more, f := runtime.CallersFrames(pcs), true, runtime.Frame{}    for more {        f, more = traces.Next()        cs = append(cs, f.File+":"+strconv.Itoa(f.Line))    }    return}

做个简单的基准测试：

func BenchmarkCacheStack(b *testing.B) {    b.Run("NewStack", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                NewStack(0)            }        })    })    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*结果：cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkCacheStackBenchmarkCacheStack/NewStackBenchmarkCacheStack/NewStack-12    1285654    983.3 ns/op    0 B/op    0 allocs/opBenchmarkCacheStack/runtime.CallersBenchmarkCacheStack/runtime.Callers-12    272685    4435 ns/op    1820 B/op    24 allocs/op*/

从测试结果来看，缓存的效果还是很明显的，由 4000ns 多提升到 1000ns 左右。

2.2 从调用栈中获取 pc 列表

我们来先看一下 golang 的调用栈,下图出自曹春晖老师的github文章 :

                                       caller                                                                                                                  +------------------+                                                                                                          |                  |                                                                                +---------------------->  --------------------                                                                                |                         |                  |                                                                                |                         | caller parent BP |                                                                                |           BP(pseudo SP) --------------------                                                                                |                         |                  |                                                                                |                         |   Local Var0     |                                                                                |                         --------------------                                                                                |                         |                  |                                                                                |                         |   .......        |                                                                                |                         --------------------                                                                                |                         |                  |                                                                                |                         |   Local VarN     |                                                                                                          --------------------                                                                          caller stack frame              |                  |                                                                                                          |   callee arg2    |                                                                                |                         |------------------|                                                                                |                         |                  |                                                                                |                         |   callee arg1    |                                                                                |                         |------------------|                                                                                |                         |                  |                                                                                |                         |   callee arg0    |                                                                                |                         ----------------------------------------------+   FP(virtual register)                              |                         |                  |                          |                                                     |                         |   return addr    |  parent return address   |                                                     +---------------------->  +------------------+---------------------------    <-------------------------------+                                                             |  caller BP               |                                    |                                                             |  (caller frame pointer)  |                                    |                                              BP(pseudo SP)  ----------------------------                                    |                                                             |                          |                                    |                                                             |     Local Var0           |                                    |                                                             ----------------------------                                    |                                                             |                          |                                                                                                  |     Local Var1           |                                                                                                  ----------------------------                            callee stack frame                                                    |                          |                                                                                                  |       .....              |                                                                                                  ----------------------------                                    |                                                             |                          |                                    |                                                             |     Local VarN           |                                    |                                           SP(Real Register) ----------------------------                                    |                                                             |                          |                                    |                                                             |                          |                                    |                                                             |                          |                                    |                                                             |                          |                                    |                                                             |                          |                                    |                                                             +--------------------------+    <-------------------------------+                                                                       callee

我们看到调用栈中的两个关键信息----"parent return address" 和 "caller BP(caller frame pointer)"，分别表示当前函数的栈帧的返回 pc 和栈基地址。

通过这两者的配合，我们就可以获得较完整的调用栈：

// func buildStack(s []uintptr) intTEXT ·buildStack(SB), NOSPLIT, $24-8    NO_LOCAL_POINTERS    MOVQ     cap+16(FP), DX     // s.cap    MOVQ     p+0(FP), AX        // s.ptr    MOVQ    $0, CX            // loop.i = 0loop:    CMPQ    CX, DX            // if i >= s.cap { return }    JAE    return                // 无符号大于等于就跳转    MOVQ    +8(BP), BX        // last pc -> BX    MOVQ    BX, 0(AX)(CX*8)        // s[i] = BX    ADDQ    $1, CX            // CX++ / i++    MOVQ    +0(BP), BP         // last BP; 展开调用栈至上一层    CMPQ    BP, $0             // if (BP) <= 0 { return }    JA loop                    // 无符号大于就跳转return:    MOVQ    CX,n+24(FP)     // ret n    RET

但是，我们知道，golang 为了提高函数调用的性能，会对符合条件的小函数进行内联。所以，我们通过上面的 buildStack 函数，实际上是无法获取完整的调用栈的。

我们可以在 golang 的源码中找到展开内联函数获取 pc 的代码：

// If there is inlining info, record the inner frames.if inldata := funcdata(f, _FUNCDATA_InlTree); inldata != nil {    inltree := (*[1 << 20]inlinedCall)(inldata)    for {        ix := pcdatavalue(f, _PCDATA_InlTreeIndex, tracepc, &cache)        if ix < 0 {            break        }        if inltree[ix].funcID == funcID_wrapper && elideWrapperCalling(lastFuncID) {            // ignore wrappers        } else if skip > 0 {            skip--        } else if n < max {            (*[1 << 20]uintptr)(unsafe.Pointer(pcbuf))[n] = pc            n++        }        lastFuncID = inltree[ix].funcID        // Back up to an instruction in the "caller".        tracepc = frame.fn.entry() + uintptr(inltree[ix].parentPc)        pc = tracepc + 1    }}

可以看到，它是在 _FUNCDATA_InlTree 中查找内联函数，并补充到 pc 列表中。我们也可以抄它，不过没必要，显然 buildStack() 拿到的 pc 列表和 runtime.Callers() 拿到的 pc 列表肯定是一一对应的（如果有不同看法，欢迎来讨论），所以我们可以利用缓存就能达到这个目的。

利用缓存提升性能的代码如下:

func NewStack2(skip int) (stack []string) {    pcs := pool.Get().(*[DefaultDepth]uintptr)    n := buildStack(pcs[:])    for i := n; i < DefaultDepth; i++ {        pcs[i] = 0    }    stack = cacheStack.Get(*pcs)    if len(stack) == 0 {        pcs1 := make([]uintptr, DefaultDepth)        npc1 := runtime.Callers(2, pcs1[:DefaultDepth])        stack = parseSlow(pcs1[:npc1])        cacheStack.Set(*pcs, stack)    }    if len(stack)<skip{        stack=nil    }else{        stack = stack[skip:]    }    pool.Put(pcs)    return}

再做个简单的基准测试：

func BenchmarkCacheStack(b *testing.B) {    b.Run("NewStack", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                NewStack(0)            }        })    })    b.Run("NewStack2", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                NewStack2(0)            }        })    })    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*结果：cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkCacheStackBenchmarkCacheStack/NewStackBenchmarkCacheStack/NewStack-12    1287921    932.1 ns/op    0 B/op    0 allocs/opBenchmarkCacheStack/NewStack2BenchmarkCacheStack/NewStack2-12    23890822    49.06 ns/op    0 B/op    0 allocs/opBenchmarkCacheStack/runtime.CallersBenchmarkCacheStack/runtime.Callers-12    263892    4285 ns/op    1820 B/op    24 allocs/op*/

基准测试的结果还是让人满意的，获取调用栈时间损耗从接近 4000ns 优化到了 50ns 左右了。

至此，提高获取调用栈的性能的目标我们已经达到了。

不过，这种优化方式也是有缺点的。就是强依赖调用栈上的 BP 信息，而由 go的下面这段代码可知，golang 的 BP 信息只在 AMD64 和 ARM64 平台上才有。实际上在其他平台上，我们也可以通过 "debug/dwarf" 包读取 golang 的 DWARF 信息，从而解析出 pc 到栈帧的映射，不过如果这样做的话，性能可能和 runtime.Callers() 就没多大差别了。

// For architectures with frame pointers, if there's// a frame, then there's a saved frame pointer here.//// NOTE: This code is not as general as it looks.// On x86, the ABI is to save the frame pointer word at the// top of the stack frame, so we have to back down over it.// On arm64, the frame pointer should be at the bottom of// the stack (with R29 (aka FP) = RSP), in which case we would// not want to do the subtraction here. But we started out without// any frame pointer, and when we wanted to add it, we didn't// want to break all the assembly doing direct writes to 8(RSP)// to set the first parameter to a called function.// So we decided to write the FP link *below* the stack pointer// (with R29 = RSP - 8 in Go functions).// This is technically ABI-compatible but not standard.// And it happens to end up mimicking the x86 layout.// Other architectures may make different decisions.if frame.varp > frame.sp && framepointer_enabled {    frame.varp -= goarch.PtrSize}//  src/runtime/runtime2.go// Must agree with internal/buildcfg.Experiment.FramePointer.const framepointer_enabled = GOARCH == "amd64" || GOARCH == "arm64"

2.3 减少 stack 冗余信息，并修改日志结构

针对 WithStack()，可以压缩一下 stack 信息。

func BenchmarkStack(b *testing.B) {    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkStackBenchmarkStack/runtime.CallersBenchmarkStack/runtime.Callers-12    252842    4947 ns/op    1820 B/op    24 allocs/op*/0

对比一下 pkg/errors 的打印信息，可以有个比较直观的感受：

func BenchmarkStack(b *testing.B) {    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkStackBenchmarkStack/runtime.CallersBenchmarkStack/runtime.Callers-12    252842    4947 ns/op    1820 B/op    24 allocs/op*/1

针对 Wrap()，可以只保留一行源码信息即可。

比如：

func BenchmarkStack(b *testing.B) {    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkStackBenchmarkStack/runtime.CallersBenchmarkStack/runtime.Callers-12    252842    4947 ns/op    1820 B/op    24 allocs/op*/2

以上代码调用栈会格式化成这样：

func BenchmarkStack(b *testing.B) {    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkStackBenchmarkStack/runtime.CallersBenchmarkStack/runtime.Callers-12    252842    4947 ns/op    1820 B/op    24 allocs/op*/3

当然，这个日志结构可以自己定义，关键是要减少无效、重复的信息量。

2.4 改进 error -> string 的性能

我们先简单测试一下 pkg/errors 生成 string 的性能：

func BenchmarkStack(b *testing.B) {    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkStackBenchmarkStack/runtime.CallersBenchmarkStack/runtime.Callers-12    252842    4947 ns/op    1820 B/op    24 allocs/op*/4

基准测试结果表明，pkg/errors 打印错误栈也造成了巨大的损耗，8us 多。因此，还是得想办法改善一下。

笔者知道的改善序列化性能的途径只有两个，一个是提高序列化速度，一个是减少内存复制次数。显然在拼接 string 的时候，前者在无法再优化，只能在后者上想想办法。

于是笔者尝试了“引入 buffer”和“提前计算 string 长度”两种方式后，选择了后者。

其实现如下：

func BenchmarkStack(b *testing.B) {    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkStackBenchmarkStack/runtime.CallersBenchmarkStack/runtime.Callers-12    252842    4947 ns/op    1820 B/op    24 allocs/op*/5

由结果来看，当前场景下，性能比 bytes.NewBuffer 方式快了不少，令人意外。

2.5 改进 panic(err)

笔者的想法是，在 panic(err) 之前检查一下该 goroutine 是否已执行过 defer Catch()。如果未执行过，则不执行 panic(err)。

代码如下：

func BenchmarkStack(b *testing.B) {    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkStackBenchmarkStack/runtime.CallersBenchmarkStack/runtime.Callers-12    252842    4947 ns/op    1820 B/op    24 allocs/op*/6

原理很简单，就是在执行 defer func(){} 前，先执行 NewGuard()。在 NewGuard() 中给当前 goroutine 打上一个标签。在 Try(err) 函数里， panic(err) 前检查该 goroutine 是否已打标签。如果未打标签则打印错误信息，并退出当前 goroutine。这比因 panic 而导致整个程序退出造成的后果要轻的多。

3. 不成功的探索

3.1 长跳转提前返回

笔者曾经想实现这样一个接口：

func BenchmarkStack(b *testing.B) {    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkStackBenchmarkStack/runtime.CallersBenchmarkStack/runtime.Callers-12    252842    4947 ns/op    1820 B/op    24 allocs/op*/7

显然，这个接口参考了 go2.0 的错误处理方式：

func BenchmarkStack(b *testing.B) {    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkStackBenchmarkStack/runtime.CallersBenchmarkStack/runtime.Callers-12    252842    4947 ns/op    1820 B/op    24 allocs/op*/8

实现代码如下：

func BenchmarkStack(b *testing.B) {    b.Run("runtime.Callers", func(b *testing.B) {        deepCall(10, func() {            b.ResetTimer()            for i := 0; i < b.N; i++ {                pcs := pool.Get().(*[DefaultDepth]uintptr)                n := runtime.Callers(2, pcs[:DefaultDepth])                var cs []string                traces, more, f := runtime.CallersFrames(pcs[:n]), true, runtime.Frame{}                for more {                    f, more = traces.Next()                    cs = append(cs, f.File+":"+strconv.Itoa(f.Line))                }                pool.Put(pcs)            }        })    })}/*cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHzBenchmarkStackBenchmarkStack/runtime.CallersBenchmarkStack/runtime.Callers-12    252842    4947 ns/op    1820 B/op    24 allocs/op*/9

package errorsimport (    "runtime"    "strconv"    "sync"    "unsafe"    "github.com/lxt1045/errors")const DefaultDepth = 32var (    cacheStack = errors.AtomicCache[[DefaultDepth]uintptr{}, []string]{}    pool       = sync.Pool{        New: func() any { return &[DefaultDepth]uintptr{} },    })func NewStack(skip int) (stack []string) {    pcs := pool.Get().(*[DefaultDepth]uintptr)    n := runtime.Callers(skip+2, pcs[:DefaultDepth-skip])    for i := n; i < DefaultDepth; i++ {        pcs[i] = 0    }    stack = cacheStack.Get(*pcs)    if len(stack) == 0 {        stack = parseSlow(pcs[:n])        cacheStack.Set(*pcs, stack)    }    pool.Put(pcs)    return}func parseSlow(pcs []uintptr) (cs []string) {    traces, more, f := runtime.CallersFrames(pcs), true, runtime.Frame{}    for more {        f, more = traces.Next()        cs = append(cs, f.File+":"+strconv.Itoa(f.Line))    }    return}0

这里的实现原理也很简单，就是在 NewTag() 函数里，把 NewTag() 函数的返回地址(pc)记录到 tag.pc 中。后续执行 tag.Try(err) 时，如果 err!=nil 就将其返回地址替换为 tag.pc，这样就跳转到了 NewTag() 的返回处继续执行。由于此时 err1!=nil，自然就进入 if 逻辑并执行 return 让函数提前返回。

但是，这样虽然能保证按预计逻辑执行，却有一个比较致命的问题，就是 debug 模式和 release 模式下，defer 函数的执行预期不一致。

比如下面的例子：

package errorsimport (    "runtime"    "strconv"    "sync"    "unsafe"    "github.com/lxt1045/errors")const DefaultDepth = 32var (    cacheStack = errors.AtomicCache[[DefaultDepth]uintptr{}, []string]{}    pool       = sync.Pool{        New: func() any { return &[DefaultDepth]uintptr{} },    })func NewStack(skip int) (stack []string) {    pcs := pool.Get().(*[DefaultDepth]uintptr)    n := runtime.Callers(skip+2, pcs[:DefaultDepth-skip])    for i := n; i < DefaultDepth; i++ {        pcs[i] = 0    }    stack = cacheStack.Get(*pcs)    if len(stack) == 0 {        stack = parseSlow(pcs[:n])        cacheStack.Set(*pcs, stack)    }    pool.Put(pcs)    return}func parseSlow(pcs []uintptr) (cs []string) {    traces, more, f := runtime.CallersFrames(pcs), true, runtime.Frame{}    for more {        f, more = traces.Next()        cs = append(cs, f.File+":"+strconv.Itoa(f.Line))    }    return}1

debug 模式下打印 "12"，relese 模式下只打印 "1"。原因是 debug 模式下， golang 编译器没有对 defer 函数做优化，会在执行到 defer 时，将 defer 函数挂到 g._defer 链表上。而 release 模式下，golang 编译器会将 defer 函数内联到 return（即汇编的 RET）命令前。我们通过汇编将 tag.Try(err) 的返回地址替换为 NewTag() 的返回地址，却并没有将 NewTag() 之后的 defer 函数添加到 return 命令前，所以 NewTag() 之后的 defer 函数并能被执行。

不过，也有办法处理，就是比较”反人类“。像这样：

package errorsimport (    "runtime"    "strconv"    "sync"    "unsafe"    "github.com/lxt1045/errors")const DefaultDepth = 32var (    cacheStack = errors.AtomicCache[[DefaultDepth]uintptr{}, []string]{}    pool       = sync.Pool{        New: func() any { return &[DefaultDepth]uintptr{} },    })func NewStack(skip int) (stack []string) {    pcs := pool.Get().(*[DefaultDepth]uintptr)    n := runtime.Callers(skip+2, pcs[:DefaultDepth-skip])    for i := n; i < DefaultDepth; i++ {        pcs[i] = 0    }    stack = cacheStack.Get(*pcs)    if len(stack) == 0 {        stack = parseSlow(pcs[:n])        cacheStack.Set(*pcs, stack)    }    pool.Put(pcs)    return}func parseSlow(pcs []uintptr) (cs []string) {    traces, more, f := runtime.CallersFrames(pcs), true, runtime.Frame{}    for more {        f, more = traces.Next()        cs = append(cs, f.File+":"+strconv.Itoa(f.Line))    }    return}2

上面代码的做法，就是把函数体用 goto tag 包裹起来，这样 golang 编译器就会取消 defer 的内联优化，把 defer 函数挂到 g._defer 链表上，从而使代码在 release 和 debug 下得到一致结果。如果这样做，这个接口就太不友善了，所以这只是一个“失败的探索”

3.2 检查 g._defer 链表

同样的原因导致的“失败探索”，还有下面这个例子:

package errorsimport (    "runtime"    "strconv"    "sync"    "unsafe"    "github.com/lxt1045/errors")const DefaultDepth = 32var (    cacheStack = errors.AtomicCache[[DefaultDepth]uintptr{}, []string]{}    pool       = sync.Pool{        New: func() any { return &[DefaultDepth]uintptr{} },    })func NewStack(skip int) (stack []string) {    pcs := pool.Get().(*[DefaultDepth]uintptr)    n := runtime.Callers(skip+2, pcs[:DefaultDepth-skip])    for i := n; i < DefaultDepth; i++ {        pcs[i] = 0    }    stack = cacheStack.Get(*pcs)    if len(stack) == 0 {        stack = parseSlow(pcs[:n])        cacheStack.Set(*pcs, stack)    }    pool.Put(pcs)    return}func parseSlow(pcs []uintptr) (cs []string) {    traces, more, f := runtime.CallersFrames(pcs), true, runtime.Frame{}    for more {        f, more = traces.Next()        cs = append(cs, f.File+":"+strconv.Itoa(f.Line))    }    return}3

package errorsimport (    "runtime"    "strconv"    "sync"    "unsafe"    "github.com/lxt1045/errors")const DefaultDepth = 32var (    cacheStack = errors.AtomicCache[[DefaultDepth]uintptr{}, []string]{}    pool       = sync.Pool{        New: func() any { return &[DefaultDepth]uintptr{} },    })func NewStack(skip int) (stack []string) {    pcs := pool.Get().(*[DefaultDepth]uintptr)    n := runtime.Callers(skip+2, pcs[:DefaultDepth-skip])    for i := n; i < DefaultDepth; i++ {        pcs[i] = 0    }    stack = cacheStack.Get(*pcs)    if len(stack) == 0 {        stack = parseSlow(pcs[:n])        cacheStack.Set(*pcs, stack)    }    pool.Put(pcs)    return}func parseSlow(pcs []uintptr) (cs []string) {    traces, more, f := runtime.CallersFrames(pcs), true, runtime.Frame{}    for more {        f, more = traces.Next()        cs = append(cs, f.File+":"+strconv.Itoa(f.Line))    }    return}4

package errorsimport (    "runtime"    "strconv"    "sync"    "unsafe"    "github.com/lxt1045/errors")const DefaultDepth = 32var (    cacheStack = errors.AtomicCache[[DefaultDepth]uintptr{}, []string]{}    pool       = sync.Pool{        New: func() any { return &[DefaultDepth]uintptr{} },    })func NewStack(skip int) (stack []string) {    pcs := pool.Get().(*[DefaultDepth]uintptr)    n := runtime.Callers(skip+2, pcs[:DefaultDepth-skip])    for i := n; i < DefaultDepth; i++ {        pcs[i] = 0    }    stack = cacheStack.Get(*pcs)    if len(stack) == 0 {        stack = parseSlow(pcs[:n])        cacheStack.Set(*pcs, stack)    }    pool.Put(pcs)    return}func parseSlow(pcs []uintptr) (cs []string) {    traces, more, f := runtime.CallersFrames(pcs), true, runtime.Frame{}    for more {        f, more = traces.Next()        cs = append(cs, f.File+":"+strconv.Itoa(f.Line))    }    return}5

以上代码，本意是想通过 g._defer 链表，检查 panic(err) 前是否有执行过 defer Catch() （检查代码未写出）。不过也是由于 golang 编译器对 defer 的优化，导致无法简单的获取到正确的 g._defer 链表。

4. 附录

根据以上优化想法，笔者写了一个 errors 库 lxt1045/errors，当前还只是一个玩具，欢迎来吐槽。

先来看一下调用栈输出格式：

package errorsimport (    "runtime"    "strconv"    "sync"    "unsafe"    "github.com/lxt1045/errors")const DefaultDepth = 32var (    cacheStack = errors.AtomicCache[[DefaultDepth]uintptr{}, []string]{}    pool       = sync.Pool{        New: func() any { return &[DefaultDepth]uintptr{} },    })func NewStack(skip int) (stack []string) {    pcs := pool.Get().(*[DefaultDepth]uintptr)    n := runtime.Callers(skip+2, pcs[:DefaultDepth-skip])    for i := n; i < DefaultDepth; i++ {        pcs[i] = 0    }    stack = cacheStack.Get(*pcs)    if len(stack) == 0 {        stack = parseSlow(pcs[:n])        cacheStack.Set(*pcs, stack)    }    pool.Put(pcs)    return}func parseSlow(pcs []uintptr) (cs []string) {    traces, more, f := runtime.CallersFrames(pcs), true, runtime.Frame{}    for more {        f, more = traces.Next()        cs = append(cs, f.File+":"+strconv.Itoa(f.Line))    }    return}6

可以看到，相对于 pkg/errors 来说，日志格式还是有所改善的。