如果你看了上一篇《Go语言开发者的Apache Arrow使用指南:数据类型》[1]中的诸多Go操作arrow的代码示例,你很可能会被代码中大量使用的RetAIn和Release方法搞晕。不光大家有这样的感觉,我也有同样的feeling:**Go是GC语言[2],为什么还要借助另外一套Retain和Release来进行内存管理呢**?
在这一篇文章中,我们就来探索一下这个问题的答案,并看看如何使用Retain和Release,顺便再了解一下Apache Arrow的Go实现原理。
注:本文的内容基于Apache Arrow Go v13版本(go.mod中go version为v13)的代码。
看过第一篇文章中的代码的童鞋可能发现了,无论是Primitive array type还是嵌套类型的诸如List array type,其array的创建套路都是这样的:
据说这个builder模式是参考了Arrow的C++实现。这里将Go的builder模式中各个类型之间的关系以下面这幅示意图的形式呈现一下:
图片
当然这幅图也大概可以作为Go Arrow实现的原理图。
从图中,我们可以看到:
// Github.com/apache/arrow/go/arrow/array/array.go
type array struct {
refCount int64
data *Data
nullBitmapBytes []byte
}
// Retain increases the reference count by 1.
// Retain may be called simultaneously from multiple goroutines.
func (a *array) Retain() {
atomic.AddInt64(&a.refCount, 1)
}
// Release decreases the reference count by 1.
// Release may be called simultaneously from multiple goroutines.
// When the reference count goes to zero, the memory is freed.
func (a *array) Release() {
debug.Assert(atomic.LoadInt64(&a.refCount) > 0, "too many releases")
if atomic.AddInt64(&a.refCount, -1) == 0 {
a.data.Release()
a.data, a.nullBitmapBytes = nil, nil
}
}
下面以Int64 array type为例:
// github.com/apache/arrow/go/arrow/array/numeric.gen.go
// A type which represents an immutable sequence of int64 values.
type Int64 struct {
array // “继承”了array的Retain和Release方法。
values []int64
}
// reuse_string_builder.go
func main() {
bldr := array.NewStringBuilder(memory.DefaultAllocator)
defer bldr.Release()
bldr.AppendValues([]string{"hello", "apache arrow"}, nil)
arr := bldr.NewArray()
defer arr.Release()
bitmaps := arr.NullBitmapBytes()
fmt.Println(hex.Dump(bitmaps))
bufs := arr.Data().Buffers()
for _, buf := range bufs {
fmt.Println(hex.Dump(buf.Buf()))
}
fmt.Println(arr)
// reuse the builder
bldr.AppendValues([]string{"happy birthday", "leo messi"}, nil)
arr1 := bldr.NewArray()
defer arr1.Release()
bitmaps1 := arr1.NullBitmapBytes()
fmt.Println(hex.Dump(bitmaps1))
bufs1 := arr1.Data().Buffers()
for _, buf := range bufs1 {
if buf != nil {
fmt.Println(hex.Dump(buf.Buf()))
}
}
fmt.Println(arr1)
}
输出上面示例运行结果:
$go run reuse_string_builder.go
00000000 03 |.|
00000000 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000000 00 00 00 00 05 00 00 00 11 00 00 00 00 00 00 00 |................|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000000 68 65 6c 6c 6f 61 70 61 63 68 65 20 61 72 72 6f |helloapache arro|
00000010 77 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |w...............|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
["hello" "apache arrow"]
00000000 03 |.|
00000000 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000000 00 00 00 00 0e 00 00 00 17 00 00 00 00 00 00 00 |................|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000000 68 61 70 70 79 20 62 69 72 74 68 64 61 79 6c 65 |happy birthdayle|
00000010 6f 20 6d 65 73 73 69 00 00 00 00 00 00 00 00 00 |o messi.........|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
["happy birthday" "leo messi"]
想必到这里,大家对Arrow的Go实现原理有了一个大概的认知了。接下来,我们再来看Go arrow实现的内存引用计数管理。
在上面图中,我们看到Go Arrow实现的几个主要接口Builder、Array、ArrayData都包含了Release和Retain方法,也就是说实现了这些接口的类型都支持采用引用计数方法(Reference Counting)进行内存的跟踪和管理。Retain方法的语义是引用计数加1,而Release方法则是引用计数减1。由于采用了原子操作对引用计数进行加减,因此这两个方法是并发安全的。当引用计数减到0时,该引用计数对应的内存块就可以被释放掉了。
Go Arrow实现的主页[3]上对引用计数的使用场景和规则做了如下说明:
有了这个说明后,我们对于Retain和Release的使用场景基本做到心里有谱了。但还有一个问题亟待解决,那就是:Go是GC语言,为何还要在GC之上加上一套引用计数呢?
这个问题我在这个issue[4]中找到了答案。一个Go arrow实现的commiter在回答issue时提到:“理论上,如果你知道你使用的是默认的Go分配器,你实际上不必在你的消费者(指的是Arrow Go包 API的使用者)代码中调用Retain/Release,可以直接让Go垃圾回收器管理一切。我们只需要确保我们在库内调用Retain/Release,这样如果消费者使用非Go GC分配器,我们就可以确保他们不会出现内存泄漏”。
下面是默认的Go分配器的实现代码:
package memory
// DefaultAllocator is a default implementation of Allocator and can be used anywhere
// an Allocator is required.
//
// DefaultAllocator is safe to use from multiple goroutines.
var DefaultAllocator Allocator = NewGoAllocator()
type GoAllocator struct{}
func NewGoAllocator() *GoAllocator { return &GoAllocator{} }
func (a *GoAllocator) Allocate(size int) []byte {
buf := make([]byte, size+alignment) // padding for 64-byte alignment
addr := int(addressOf(buf))
next := roundUpToMultipleOf64(addr)
if addr != next {
shift := next - addr
return buf[shift : size+shift : size+shift]
}
return buf[:size:size]
}
func (a *GoAllocator) Reallocate(size int, b []byte) []byte {
if size == len(b) {
return b
}
newBuf := a.Allocate(size)
copy(newBuf, b)
return newBuf
}
func (a *GoAllocator) Free(b []byte) {}
我们看到默认的Allocator只是分配一个原生切片,并且切片的底层内存块要保证64-byte对齐。
但为什么Retain和Release依然存在且需要调用呢?这位commiter给出了他理解的几点原因:
基于这些原因,Go Arrow实现保留了Retain和Release,虽然有上门的一些场景使用方法,但这两个方法的加入一定程度上增加了Go Arrow API使用的门槛。并且在重度使用Go Arrow实现的程序中,大家务必对程序做稳定性长测试验证,以确保memory没有leak。
《In-Memory Analytics with Apache Arrow》[5]一书在第二章中提到了采用Arrow实现zerocopy的内存数据共享的原理,这里将其称为“切片(slice)原理”,用书中的例子简单描述就是这样的:假设你想对一个有数十亿行的非常大的数据集进行一些分析操作。提高这种操作性能的一个常见方法是对行的子集进行并行操作,即仅通过对数组和数据缓冲区进行切分,而不需要复制底层数据。这样你操作的每个批次都不是一个副本--它只是数据的一个视图。书中还给出了如下示意图:
图片
右侧切片列中的每个切片的虚线表示它们只是各自列中的数据子集的视图,每个切片都可以安全地进行并行操作。
array type是逻辑上immutable的,底层data buffer一旦建立后,便可以通过切片的方式来以zerocopy方式做内存数据共享,极大提高了数据操作的性能。
本文介绍了Go arrow实现的主要结构以及实现模式:builder模式,并结合Go arrow官方资料说明了采用引用计数进行内存管理的原因与使用方法,最后介绍了Arrow实现ZeroCopy的内存数据共享的原理。这些将为后续继续深入学习Arrow高级数据类型/结构奠定良好的基础。
注:本文涉及的源代码在这里[6]可以下载。
Gopher Daily(Gopher每日新闻)归档仓库 - https://github.com/bigwhite/gopherdaily
我的联系方式:
[1] 《Go语言开发者的Apache Arrow使用指南:数据类型》: https://tonybai.com/2023/06/25/a-guide-of-using-apache-arrow-for-gopher-part1
[2] Go是GC语言: https://tonybai.com/2023/06/13/understand-go-gc-overhead-behind-the-convenience
[3] Go Arrow实现的主页: https://github.com/apache/arrow/tree/main/go
[4] 这个issue: https://github.com/apache/arrow/issues/35232
[5] 《In-Memory Analytics with Apache Arrow》: https://book.douban.com/subject/35954154/
[6] 这里: https://github.com/bigwhite/experiments/blob/master/arrow/memory-management
[7] “Gopher部落”知识星球: https://wx.zsxq.com/dweb2/index/group/51284458844544
[8] 链接地址: https://m.do.co/c/bff6eed92687