Skip to content

Conversation

@pixelherodev
Copy link
Contributor

Rationale for this change

This avoids an unnecessary heap allocation, reducing GC pressure and increasing latency and throughput of anything that processes messages.

What changes are included in this PR?

fieldMetadata returns the flatbuf.FieldNode that it loads from metadata directly, instead of a pointer to it. The three callers that pass through the pointer also now return the FieldNode directly.

Are these changes tested?

Yes.

Are there any user-facing changes?

No; the pointer was not exposed anyways.

@pixelherodev
Copy link
Contributor Author

By the way, I've got another half dozen optimizations coming, at least. Most of them are small, non-behavioral changes like this. If it would be easier, I can send a single larger PR?

@zeroshade
Copy link
Member

A single larger PR (within reason) would be fine.

Ideally, could you also include any relevant go test -bench runs which show the improved performance?

@pixelherodev
Copy link
Contributor Author

A single larger PR (within reason) would be fine.

Ideally, could you also include any relevant go test -bench runs which show the improved performance?

Unfortunately most of this has been measured against a larger binary - a database - where the wins are noticed via reductions in profile times. I can include screenshots of the flame graph portions that are relevant to Arrow, if that helps? Unfortunately, I'm likely not authorized to share the profiles themselves, as they also contain proprietary symbols.

I can write up fresh benchmarks?

@pixelherodev
Copy link
Contributor Author

Also, for optimizations like this one, short-lived benchmarks don't always catch the improvement - the actual change in runtime per invocation here is pretty trivial, the real win is that it reduces the number of live objects on the heap, which makes every GC pass a little bit faster, as well as slightly reducing their frequency...

@zeroshade
Copy link
Member

I'd be surprised if running go test -bench -mem to run the existing benchmarks doesn't show the reduction in allocations and such at a minimum.

@pixelherodev
Copy link
Contributor Author

Hadn't realized there was an existing benchmark suite, will do!

@pixelherodev
Copy link
Contributor Author

Before:

goos: linux
goarch: amd64
pkg: github.com/apache/arrow-go/v18/arrow/ipc
cpu: AMD Ryzen 9 7900X3D 12-Core Processor          
BenchmarkIPC/Writer/codec=plain-24         	 203998	     7155 ns/op	   8160 B/op	     91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	 171043	     6329 ns/op	   6131 B/op	     83 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	   4206	   316075 ns/op	2338796 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	  33127	    30308 ns/op	  25878 B/op	    194 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	  86971	    14408 ns/op	  13341 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	  95056	    12548 ns/op	  10285 B/op	    124 allocs/op
PASS
ok  	github.com/apache/arrow-go/v18/arrow/ipc	8.449s

After:

goos: linux
goarch: amd64
pkg: github.com/apache/arrow-go/v18/arrow/ipc
cpu: AMD Ryzen 9 7900X3D 12-Core Processor          
BenchmarkIPC/Writer/codec=plain-24         	 211380	     6277 ns/op	   8160 B/op	     91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	 172264	     6650 ns/op	   6067 B/op	     81 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	   3980	   345964 ns/op	2338709 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	  42043	    28065 ns/op	  25815 B/op	    192 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	 113084	    11949 ns/op	  12757 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	  94621	    12256 ns/op	  10070 B/op	    122 allocs/op
PASS
ok  	github.com/apache/arrow-go/v18/arrow/ipc	8.533s

Writer code should be unaffected, I think it's just inter-run variance (possibly running on a different CCD?)

@pixelherodev
Copy link
Contributor Author

pixelherodev commented Oct 23, 2025

CI failure looks like flakiness, since it's passing on the Ubunth image?

@pixelherodev
Copy link
Contributor Author

PASS
ok  	github.com/apache/arrow-go/v18/arrow/flight	1.046s

That test is passing locally.

@pixelherodev
Copy link
Contributor Author

With that last commit as well:

BenchmarkIPC/Writer/codec=plain-24         	 223406	     5536 ns/op	   8160 B/op	    91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	 236199	     4820 ns/op	   6065 B/op	     78 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	   3708	   325071 ns/op	2338712 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	  44013	    25784 ns/op	  25808 B/op	    189 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	 101749	    11613 ns/op	  12890 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	 129542	    10651 ns/op	  10081 B/op	    119 allocs/op

Noam Preil added 16 commits October 23, 2025 16:59
This avoids an unnecessary heap allocation, reducing GC pressure and
increasing latency and throughput of anything that processes messages.
This is a drive-by patch; the test was failing for the other
optimization, and I noticed this when I went to take a look and figured
I'd just send the patch.
The Table() method is unused, so it's fine to embed the table as is.
Drops another allocation per ipc/Message, improves throughput further.
This is a bit bloated, but I wasn't sure if the API for NewMessageReader
and NewReaderFromMessageReader could be changed to take in the config
struct directly.

It would be a lot cleaner to expose both APIs with Config directly...

This drops another allocation from every call to NewReader, which - if a
new reader must be created for every API call - can be noticeable...
In branch 1, shift is defined as next-addr. We then slice buf with shift
as the offset, and with size+shift as length and capacity.

In branch 2, addr == next, and thus shift == 0. size+shift == size.

The slice statements are thus equivalent; there's no reason to be
branching here.
Before: cannotInlineFunction: complexity 148 exceeds budget
After: canInlineFunction (cost: 68)
Removes four allocations from every NewReader call.
the meta object in messageReader is a wrapper around flatbuf.Message,
which is static. There's no allocator usage. As such, we can reuse the
memory.Buffer heap object to at least avoid re-allocating it on each
call, since it never really changes.
Ideally, we'd reuse the underlying allocation instead of freeing it
and reallocating each call, but that seems to be exposing a bug in the
buffering logic.

Still, this is a few more allocations gone...
Avoid branching in Message()
@pixelherodev
Copy link
Contributor Author

....hmmm, I'd assumed since the Message is invalidated on each call to Next() that that would mean that nothing would try holding on to the underlying buffers. It appears that may be false, from some of the failures?

@pixelherodev
Copy link
Contributor Author

// Code generated by the FlatBuffers compiler. DO NOT EDIT.

.... So I only just noticed this message. Sigh.

I'll have to send a patch to the generation even further upstream, I'm guessing? 😓

@zeroshade
Copy link
Member

I'll have to send a patch to the generation even further upstream, I'm guessing? 😓

yup. Did you re-generate the flatbuffer code for this?

@pixelherodev
Copy link
Contributor Author

I'll have to send a patch to the generation even further upstream, I'm guessing? 😓

yup. Did you re-generate the flatbuffer code for this?

I tested it with a patch to flatc - the problem is that some of these changes work by changing the interfaces that are generated, which is a breaking change. I'm not sure they'll want those changes, and they don't seem to look at patches much anyways :/

diff --git a/src/idl_gen_go.cpp b/src/idl_gen_go.cpp
index f4a03cec..0c2b14fc 100644
--- a/src/idl_gen_go.cpp
+++ b/src/idl_gen_go.cpp
@@ -186,7 +186,7 @@ class GoGenerator : public BaseGenerator {
   // Most field accessors need to retrieve and test the field offset first,
   // this is the prefix code for that.
   std::string OffsetPrefix(const FieldDef& field) {
-    return "{\n\to := flatbuffers.UOffsetT(rcv._tab.Offset(" +
+    return "{\n\to := flatbuffers.UOffsetT(rcv.Offset(" +
            NumToString(field.value.offset) + "))\n\tif o != 0 {\n";
   }
 
@@ -198,7 +198,7 @@ class GoGenerator : public BaseGenerator {
 
     // _ is reserved in flatbuffers field names, so no chance of name
     // conflict:
-    code += "_tab ";
+    code += "";
     code += struct_def.fixed ? "flatbuffers.Struct" : "flatbuffers.Table";
     code += "\n}\n\n";
   }
@@ -323,9 +323,9 @@ class GoGenerator : public BaseGenerator {
 
     for (int i = 0; i < 2; i++) {
       code += "func Get" + size_prefix[i] + "RootAs" + struct_type;
-      code += "(buf []byte, offset flatbuffers.UOffsetT) ";
-      code += "*" + struct_type + "";
-      code += " {\n";
+      code += "(buf []byte, offset flatbuffers.UOffsetT) (x ";
+      code += struct_type + "";
+      code += ") {\n";
       if (i == 0) {
         code += "\tn := flatbuffers.GetUOffsetT(buf[offset:])\n";
       } else {
@@ -333,11 +333,10 @@ class GoGenerator : public BaseGenerator {
             "\tn := "
             "flatbuffers.GetUOffsetT(buf[offset+flatbuffers.SizeUint32:])\n";
       }
-      code += "\tx := &" + struct_type + "{}\n";
       if (i == 0) {
-        code += "\tx.Init(buf, n+offset)\n";
+        code += "\tx.Table = flatbuffers.Table{Bytes: buf, Pos: n+offset}\n";
       } else {
-        code += "\tx.Init(buf, n+offset+flatbuffers.SizeUint32)\n";
+        code += "\tx.Table = flatbuffers.Table{Bytes: buf, Pos: n+offset+flatbuffers.SizeUint32}\n";
       }
       code += "\treturn x\n";
       code += "}\n\n";
@@ -371,24 +370,8 @@ class GoGenerator : public BaseGenerator {
     GenReceiver(struct_def, code_ptr);
     code += " Init(buf []byte, i flatbuffers.UOffsetT) ";
     code += "{\n";
-    code += "\trcv._tab.Bytes = buf\n";
-    code += "\trcv._tab.Pos = i\n";
-    code += "}\n\n";
-  }
-
-  // Implement the table accessor
-  void GenTableAccessor(const StructDef& struct_def, std::string* code_ptr) {
-    std::string& code = *code_ptr;
-
-    GenReceiver(struct_def, code_ptr);
-    code += " Table() flatbuffers.Table ";
-    code += "{\n";
-
-    if (struct_def.fixed) {
-      code += "\treturn rcv._tab.Table\n";
-    } else {
-      code += "\treturn rcv._tab\n";
-    }
+    code += "\trcv.Bytes = buf\n";
+    code += "\trcv.Pos = i\n";
     code += "}\n\n";
   }
 
@@ -400,7 +383,7 @@ class GoGenerator : public BaseGenerator {
     GenReceiver(struct_def, code_ptr);
     code += " " + namer_.Function(field) + "Length(";
     code += ") int " + OffsetPrefix(field);
-    code += "\t\treturn rcv._tab.VectorLen(o)\n\t}\n";
+    code += "\t\treturn rcv.VectorLen(o)\n\t}\n";
     code += "\treturn 0\n}\n\n";
   }
 
@@ -412,7 +395,7 @@ class GoGenerator : public BaseGenerator {
     GenReceiver(struct_def, code_ptr);
     code += " " + namer_.Function(field) + "Bytes(";
     code += ") []byte " + OffsetPrefix(field);
-    code += "\t\treturn rcv._tab.ByteVector(o + rcv._tab.Pos)\n\t}\n";
+    code += "\t\treturn rcv.ByteVector(o + rcv.Pos)\n\t}\n";
     code += "\treturn nil\n}\n\n";
   }
 
@@ -426,7 +409,7 @@ class GoGenerator : public BaseGenerator {
     code += "() " + TypeName(field) + " {\n";
     code += "\treturn " +
             CastToEnum(field.value.type,
-                       getter + "(rcv._tab.Pos + flatbuffers.UOffsetT(" +
+                       getter + "(rcv.Pos + flatbuffers.UOffsetT(" +
                            NumToString(field.value.offset) + "))");
     code += "\n}\n";
   }
@@ -445,7 +428,7 @@ class GoGenerator : public BaseGenerator {
     } else {
       code += "\t\treturn ";
     }
-    code += CastToEnum(field.value.type, getter + "(o + rcv._tab.Pos)");
+    code += CastToEnum(field.value.type, getter + "(o + rcv.Pos)");
     if (field.IsScalarOptional()) {
       code += "\n\t\treturn &v";
     }
@@ -467,7 +450,7 @@ class GoGenerator : public BaseGenerator {
     code += "\tif obj == nil {\n";
     code += "\t\tobj = new(" + TypeName(field) + ")\n";
     code += "\t}\n";
-    code += "\tobj.Init(rcv._tab.Bytes, rcv._tab.Pos+";
+    code += "\tobj.Init(rcv.Bytes, rcv.Pos+";
     code += NumToString(field.value.offset) + ")";
     code += "\n\treturn obj\n";
     code += "}\n";
@@ -480,19 +463,14 @@ class GoGenerator : public BaseGenerator {
     std::string& code = *code_ptr;
     GenReceiver(struct_def, code_ptr);
     code += " " + namer_.Function(field);
-    code += "(obj *";
-    code += TypeName(field);
-    code += ") *" + TypeName(field) + " " + OffsetPrefix(field);
+    code += "() (obj " + TypeName(field) + ", ok bool) " + OffsetPrefix(field);
     if (field.value.type.struct_def->fixed) {
-      code += "\t\tx := o + rcv._tab.Pos\n";
+      code += "\t\tx := o + rcv.Pos\n";
     } else {
-      code += "\t\tx := rcv._tab.Indirect(o + rcv._tab.Pos)\n";
+      code += "\t\tx := rcv.Indirect(o + rcv.Pos)\n";
     }
-    code += "\t\tif obj == nil {\n";
-    code += "\t\t\tobj = new(" + TypeName(field) + ")\n";
-    code += "\t\t}\n";
-    code += "\t\tobj.Init(rcv._tab.Bytes, x)\n";
-    code += "\t\treturn obj\n\t}\n\treturn nil\n";
+    code += "\t\tobj.Init(rcv.Bytes, x)\n\t\tok = true\n";
+    code += "\t}\n\treturn\n";
     code += "}\n\n";
   }
 
@@ -504,7 +482,7 @@ class GoGenerator : public BaseGenerator {
     code += " " + namer_.Function(field);
     code += "() " + TypeName(field) + " ";
     code += OffsetPrefix(field) + "\t\treturn " + GenGetter(field.value.type);
-    code += "(o + rcv._tab.Pos)\n\t}\n\treturn nil\n";
+    code += "(o + rcv.Pos)\n\t}\n\treturn nil\n";
     code += "}\n\n";
   }
 
@@ -532,13 +510,13 @@ class GoGenerator : public BaseGenerator {
     code += " " + namer_.Function(field);
     code += "(obj *" + TypeName(field);
     code += ", j int) bool " + OffsetPrefix(field);
-    code += "\t\tx := rcv._tab.Vector(o)\n";
+    code += "\t\tx := rcv.Vector(o)\n";
     code += "\t\tx += flatbuffers.UOffsetT(j) * ";
     code += NumToString(InlineSize(vectortype)) + "\n";
     if (!(vectortype.struct_def->fixed)) {
-      code += "\t\tx = rcv._tab.Indirect(x)\n";
+      code += "\t\tx = rcv.Indirect(x)\n";
     }
-    code += "\t\tobj.Init(rcv._tab.Bytes, x)\n";
+    code += "\t\tobj.Init(rcv.Bytes, x)\n";
     code += "\t\treturn true\n\t}\n";
     code += "\treturn false\n";
     code += "}\n\n";
@@ -566,9 +544,9 @@ class GoGenerator : public BaseGenerator {
     code += "(obj *" + TypeName(field);
     code += ", key " + NativeType(key_field.value.type) + ") bool " +
             OffsetPrefix(field);
-    code += "\t\tx := rcv._tab.Vector(o)\n";
+    code += "\t\tx := rcv.Vector(o)\n";
     code += "\t\treturn ";
-    code += "obj.LookupByKey(key, x, rcv._tab.Bytes)\n";
+    code += "obj.LookupByKey(key, x, rcv.Bytes)\n";
     code += "\t}\n";
     code += "\treturn false\n";
     code += "}\n\n";
@@ -585,7 +563,7 @@ class GoGenerator : public BaseGenerator {
     code += " " + namer_.Function(field);
     code += "(j int) " + TypeName(field) + " ";
     code += OffsetPrefix(field);
-    code += "\t\ta := rcv._tab.Vector(o)\n";
+    code += "\t\ta := rcv.Vector(o)\n";
     code += "\t\treturn " +
             CastToEnum(field.value.type,
                        GenGetter(field.value.type) +
@@ -806,12 +784,12 @@ class GoGenerator : public BaseGenerator {
                                  const FieldDef& field, std::string* code_ptr) {
     std::string& code = *code_ptr;
     std::string setter =
-        "rcv._tab.Mutate" + namer_.Method(GenTypeBasic(field.value.type));
+        "rcv.Mutate" + namer_.Method(GenTypeBasic(field.value.type));
     GenReceiver(struct_def, code_ptr);
     code += " Mutate" + namer_.Function(field);
     code +=
         "(n " + GenTypeGet(field.value.type) + ") bool {\n\treturn " + setter;
-    code += "(rcv._tab.Pos+flatbuffers.UOffsetT(";
+    code += "(rcv.Pos+flatbuffers.UOffsetT(";
     code += NumToString(field.value.offset) + "), ";
     code += CastToBaseType(field.value.type, "n") + ")\n}\n\n";
   }
@@ -820,7 +798,7 @@ class GoGenerator : public BaseGenerator {
   void MutateScalarFieldOfTable(const StructDef& struct_def,
                                 const FieldDef& field, std::string* code_ptr) {
     std::string& code = *code_ptr;
-    std::string setter = "rcv._tab.Mutate" +
+    std::string setter = "rcv.Mutate" +
                          namer_.Method(GenTypeBasic(field.value.type)) + "Slot";
     GenReceiver(struct_def, code_ptr);
     code += " Mutate" + namer_.Function(field);
@@ -837,12 +815,12 @@ class GoGenerator : public BaseGenerator {
     std::string& code = *code_ptr;
     auto vectortype = field.value.type.VectorType();
     std::string setter =
-        "rcv._tab.Mutate" + namer_.Method(GenTypeBasic(vectortype));
+        "rcv.Mutate" + namer_.Method(GenTypeBasic(vectortype));
     GenReceiver(struct_def, code_ptr);
     code += " Mutate" + namer_.Function(field);
     code += "(j int, n " + TypeName(field) + ") bool ";
     code += OffsetPrefix(field);
-    code += "\t\ta := rcv._tab.Vector(o)\n";
+    code += "\t\ta := rcv.Vector(o)\n";
     code += "\t\treturn " + setter + "(";
     code += "a+flatbuffers.UOffsetT(j*";
     code += NumToString(InlineSize(vectortype)) + "), ";
@@ -907,8 +885,6 @@ class GoGenerator : public BaseGenerator {
     // Generate the Init method that sets the field in a pre-existing
     // accessor object. This is to allow object reuse.
     InitializeExisting(struct_def, code_ptr);
-    // Generate _tab accessor
-    GenTableAccessor(struct_def, code_ptr);
 
     // Generate struct fields accessors
     for (auto it = struct_def.fields.vec.begin();
@@ -1282,13 +1258,13 @@ class GoGenerator : public BaseGenerator {
         code +=
             "\tt." + field_field + " = rcv." + field_field + "(nil).UnPack()\n";
       } else if (field.value.type.base_type == BASE_TYPE_UNION) {
-        const std::string field_table = field_var + "Table";
-        code += "\t" + field_table + " := flatbuffers.Table{}\n";
+        const std::string fielde = field_var + "Table";
+        code += "\t" + fielde + " := flatbuffers.Table{}\n";
         code +=
-            "\tif rcv." + namer_.Method(field) + "(&" + field_table + ") {\n";
+            "\tif rcv." + namer_.Method(field) + "(&" + fielde + ") {\n";
         code += "\t\tt." + field_field + " = rcv." +
                 namer_.Method(field.name + UnionTypeFieldSuffix()) +
-                "().UnPack(" + field_table + ")\n";
+                "().UnPack(" + fielde + ")\n";
         code += "\t}\n";
       } else {
         FLATBUFFERS_ASSERT(0);
@@ -1399,13 +1375,13 @@ class GoGenerator : public BaseGenerator {
   std::string GenGetter(const Type& type) {
     switch (type.base_type) {
       case BASE_TYPE_STRING:
-        return "rcv._tab.ByteVector";
+        return "rcv.ByteVector";
       case BASE_TYPE_UNION:
-        return "rcv._tab.Union";
+        return "rcv.Union";
       case BASE_TYPE_VECTOR:
         return GenGetter(type.VectorType());
       default:
-        return "rcv._tab.Get" + namer_.Function(GenTypeBasic(type));
+        return "rcv.Get" + namer_.Function(GenTypeBasic(type));
     }
   }
 

Regenerated the files with that patch to flatc, and updated the usage slightly to conform, and that's another alloc per NewReader and another alloc per Message call gone...

@pixelherodev
Copy link
Contributor Author

...ah. Regenerated files are missing the licenses.

@pixelherodev
Copy link
Contributor Author

Before:

BenchmarkIPC/Writer/codec=plain-24         	 203998	     7155 ns/op	   8160 B/op	     91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	 171043	     6329 ns/op	   6131 B/op	     83 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	   4206	   316075 ns/op	2338796 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	  33127	    30308 ns/op	  25878 B/op	    194 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	  86971	    14408 ns/op	  13341 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	  95056	    12548 ns/op	  10285 B/op	    124 allocs/op

Current:

BenchmarkIPC/Writer/codec=plain-24         	 200580	     5762 ns/op	   8160 B/op	     91 allocs/op
BenchmarkIPC/Reader/codec=plain-24         	 274100	     4364 ns/op	   5442 B/op	     62 allocs/op
BenchmarkIPC/Writer/codec=zstd-24          	   2731	   369465 ns/op	2338754 B/op	    150 allocs/op
BenchmarkIPC/Reader/codec=zstd-24          	  47542	    25236 ns/op	  25122 B/op	    171 allocs/op
BenchmarkIPC/Writer/codec=lz4-24           	 103572	    10842 ns/op	  12689 B/op	    134 allocs/op
BenchmarkIPC/Reader/codec=lz4-24           	 137494	     8809 ns/op	   9299 B/op	    101 allocs/op

@pixelherodev
Copy link
Contributor Author

I also haven't been able to look into the arrow/flight failure, as that test has never passed for me locally anyways...

@pixelherodev
Copy link
Contributor Author

I have more optimizations locally, but unfortunately most of them deliberately break functionality (e.g. arrowflight). Going to take a look at profiles of the benchmarks and try to improve this as much as I can without those :/

@pixelherodev
Copy link
Contributor Author

pixelherodev commented Oct 27, 2025

I think I'm going to split this into a few small PRs if that makes sense:

  • Regenerate the flatbuf package without any changes from me (it's out of sync with the arrow/ repo)

    • As a follow up to that, regenerate it with patches to remove unnecessary allocations, and document in the headers that the generated code has been modified to remove allocations, and make the changes to e.g. initFB and our usage of it to avoid heap escapes
  • Separately, the internal changes to the IPC that just clean up buffer management (preallocation / reuse)

  • Separately, the miscellaneous small changes that don't fit in

That okay?

@zeroshade
Copy link
Member

Splitting it up sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants