Multiaddr: Replace Arc<Vec<u8>> with inline storage #1510

rklaehn · 2020-03-24T20:39:59Z

Storage stores small bytes inline, and spills to the heap only once that is no longer possible.

A typical small multiaddr consisting of e.g. ip4 host/port, ip6 host/port etc. will be stored inline. Only some complex multiaddrs containing multiple peer ids will be stored on the heap as an Arc<[u8]>.

The push and with fns allocate temporary vectors, but I think this is an acceptable tradeoff for making the in memory representation of multiaddrs much smaller. Many of the multiadds that you have you don't ever manually build but get from somewhere else.

misc/multiaddr/src/storage.rs

rklaehn · 2020-03-25T13:35:48Z

Note: this is basically identical to the optimization that has been done in rust-multihash in multiformats/rust-multihash#47

mxinden

For what my review is worth this looks good to me.

I would prefer every performance optimization to come with a benchmark preventing future-us to introduce performance regressions. Something along the lines of https://gist.github.com/vmx/0dc0d981b522e2834084559faa0dcb56. What do you all think?

misc/multiaddr/src/storage.rs

rklaehn · 2020-03-26T08:32:54Z

@mxinden I can add a clone benchmark that compares cloning multiaddrs that are slightly below the limit to multiadds that are slightly above the limit. But it will be hard to integrate this into CI, so it would just be something you run on demand when you want to mess with this part of the code...

rklaehn · 2020-03-26T09:49:30Z

Added a small bench and a test that shows where the limits of inline storage are.

Here are the results:

$ cargo bench --bench clone
clone 10 max inlined   time:   [45.106 ns 45.587 ns 46.119 ns]                                   
clone 10 min heap      time:   [135.29 ns 135.86 ns 136.50 ns]

Note that the main point of the optimization is to avoid touching the heap and reducing the total memory usage, but a factor 3 speedup in cloning is a nice benefit. Since cloning involves just increasing the strong_count of the arc, it is independent of size for the heap case.

twittner · 2020-03-26T11:17:10Z

misc/multiaddr/src/storage.rs

+/// a multiaddr containing an ipv4 or ipv6 address and port.
+///
+/// More complex multiaddrs like those containing peer ids will be stored on the heap.
+const MAX_INLINE: usize = 30;


Why not 62? Most peer IDs would fit in there as well and peer IDs are quite a common case.

We could make it 62. I have no idea about the size distribution of the typical multiaddr. From what I have seen in rust-libp2p, and for what I intend to use this for, you often retain addresses for other nodes, which are just host/port/protocol. If you say that the ones containing a peer id are sufficiently common, I will change this to 62.

Unless large multiaddrs are way more popular than I think, 64 would be too big. If you go to the arced version for large multihashes, it is not the end of the world. And there are situations where having lots of 64 byte objects would be worse than having a bunch of pointers to arcs.

misc/multiaddr/src/lib.rs

twittner · 2020-03-26T11:24:58Z

misc/multiaddr/src/lib.rs

        Some(protocol)
    }

    /// Like [`Multiaddr::push`] but consumes `self`.
    pub fn with(mut self, p: Protocol<'_>) -> Self {
-        let mut w = io::Cursor::<&mut Vec<u8>>::new(Arc::make_mut(&mut self.bytes));
+        let mut w = io::Cursor::new(self.to_vec());


Can we not avoid this allocation if p fits into Storage::Inline?

Same as above.

twittner · 2020-03-26T16:40:29Z

Use smallvec to avoid some allocations in push and with

Sorry, what I had in mind was to use SmallVec everywhere instead of Storage. From my perspective all that is needed compared to master is this change:

diff --git a/misc/multiaddr/Cargo.toml b/misc/multiaddr/Cargo.toml
index c30b4504..a6ce960d 100644
--- a/misc/multiaddr/Cargo.toml
+++ b/misc/multiaddr/Cargo.toml
@@ -16,6 +16,7 @@ data-encoding = "2.1"
 multihash = "0.10"
 percent-encoding = "2.1.0"
 serde = "1.0.70"
+smallvec = { version = "1.0", features = ["write"] }
 static_assertions = "1.1"
 unsigned-varint = "0.3"
 url = { version = "2.1.0", default-features = false }
diff --git a/misc/multiaddr/src/lib.rs b/misc/multiaddr/src/lib.rs
index f81b8ada..58906a48 100644
--- a/misc/multiaddr/src/lib.rs
+++ b/misc/multiaddr/src/lib.rs
@@ -14,10 +14,10 @@ use serde::{
     Serializer,
     de::{self, Error as DeserializerError}
 };
+use smallvec::SmallVec;
 use std::{
     convert::TryFrom,
     fmt,
-    io,
     iter::FromIterator,
     net::{IpAddr, Ipv4Addr, Ipv6Addr},
     result::Result as StdResult,
@@ -35,19 +35,22 @@ static_assertions::const_assert! {
     std::mem::size_of::<usize>() <= std::mem::size_of::<u64>()
 }
 
+/// Size to use with `SmallVec`.
+const INLINE_SIZE: usize = 32;
+
 /// Representation of a Multiaddr.
 #[derive(PartialEq, Eq, Clone, Hash)]
-pub struct Multiaddr { bytes: Arc<Vec<u8>> }
+pub struct Multiaddr { bytes: Arc<SmallVec<[u8; INLINE_SIZE]>> }
 
 impl Multiaddr {
     /// Create a new, empty multiaddress.
     pub fn empty() -> Self {
-        Self { bytes: Arc::new(Vec::new()) }
+        Self { bytes: Arc::new(SmallVec::new()) }
     }
 
     /// Create a new, empty multiaddress with the given capacity.
     pub fn with_capacity(n: usize) -> Self {
-        Self { bytes: Arc::new(Vec::with_capacity(n)) }
+        Self { bytes: Arc::new(SmallVec::with_capacity(n)) }
     }
 
     /// Return the length in bytes of this multiaddress.
@@ -73,9 +76,8 @@ impl Multiaddr {
     /// ```
     ///
     pub fn push(&mut self, p: Protocol<'_>) {
-        let mut w = io::Cursor::<&mut Vec<u8>>::new(Arc::make_mut(&mut self.bytes));
-        w.set_position(w.get_ref().len() as u64);
-        p.write_bytes(&mut w).expect("Writing to a `io::Cursor<&mut Vec<u8>>` never fails.")
+        let mut w = Arc::make_mut(&mut self.bytes);
+        p.write_bytes(&mut w).expect("Writing to a `SmallVec` never fails.")
     }
 
     /// Pops the last `Protocol` of this multiaddr, or `None` if the multiaddr is empty.
@@ -107,9 +109,8 @@ impl Multiaddr {
 
     /// Like [`Multiaddr::push`] but consumes `self`.
     pub fn with(mut self, p: Protocol<'_>) -> Self {
-        let mut w = io::Cursor::<&mut Vec<u8>>::new(Arc::make_mut(&mut self.bytes));
-        w.set_position(w.get_ref().len() as u64);
-        p.write_bytes(&mut w).expect("Writing to a `io::Cursor<&mut Vec<u8>>` never fails.");
+        let mut w = Arc::make_mut(&mut self.bytes);
+        p.write_bytes(&mut w).expect("Writing to a `SmallVec` never fails.");
         self
     }
 
@@ -212,7 +213,7 @@ impl<'a> FromIterator<Protocol<'a>> for Multiaddr {
     where
         T: IntoIterator<Item = Protocol<'a>>,
     {
-        let mut writer = Vec::new();
+        let mut writer = SmallVec::new();
         for cmp in iter {
             cmp.write_bytes(&mut writer).expect("Writing to a `Vec` never fails.");
         }
@@ -224,7 +225,7 @@ impl FromStr for Multiaddr {
     type Err = Error;
 
     fn from_str(input: &str) -> Result<Self> {
-        let mut writer = Vec::new();
+        let mut writer = SmallVec::new();
         let mut parts = input.split('/').peekable();
 
         if Some("") != parts.next() {
@@ -262,7 +263,7 @@ impl<'a> Iterator for Iter<'a> {
 
 impl<'a> From<Protocol<'a>> for Multiaddr {
     fn from(p: Protocol<'a>) -> Multiaddr {
-        let mut w = Vec::new();
+        let mut w = SmallVec::new();
         p.write_bytes(&mut w).expect("Writing to a `Vec` never fails.");
         Multiaddr { bytes: Arc::new(w) }
     }
@@ -299,7 +300,7 @@ impl TryFrom<Vec<u8>> for Multiaddr {
             let (_, s) = Protocol::from_bytes(slice)?;
             slice = s
         }
-        Ok(Multiaddr { bytes: Arc::new(v) })
+        Ok(Multiaddr { bytes: Arc::new(SmallVec::from(v)) })
     }
 }

twittner · 2020-03-26T16:41:52Z

Ah sorry, that makes no sense. We still do the Arc.

misc/multiaddr/src/lib.rs

misc/multiaddr/src/storage.rs

twittner · 2020-03-26T16:57:43Z

misc/multiaddr/src/storage.rs

+    }
+
+    #[test]
+    fn roundtrip() {


Is this test not redundant given the property test below?

Well, this one is guaranteed to hit all the interesting cases, whereas with the property based test you depend on luck. But this was copied from multihash, which is an independent repository. So if you want to reduce the number of tests for rust-libp2p, which is a bit bigger, I can remove it.

If you are concerned about the test runtime, it would probably be best to keep this one and drop the property based test.

Storage stores small bytes inline, and spills to the heap only once that is no longer possible.

Co-Authored-By: Max Inden <mail@max-inden.de>

Co-Authored-By: Toralf Wittner <tw@dtex.org>

tomaka · 2020-03-30T10:23:18Z

This comment is kind of irrelevant here since we're also using Bytes, but I'm really not a fan of this kind of magic and tend to prefer explicitness.

To me, instead of optimizing the insides of the Multiaddr we should either avoid cloning multiaddresses or use a Arc<Multiaddr>.

I also strongly think that we're micro-optimizing here. Benchmarking how much time it takes to clone a Multiaddr is based upon the assumption that we're spending a lot of time cloning multiaddresses, while my guess is that this is probably using not even 0.1% of the CPU time of a typical libp2p application.

That being said I don't care that much and I'm not opposed to merging this.

rklaehn · 2020-03-30T16:26:03Z

Just to be clear, the purpose of this change is not to make clone faster. That is just a welcome but not very important side effect. I am perfectly happy with the speed of cloning an Arc.

The purpose of the optimization is to reduce the memory usage of a multiaddr, make memory usage more deterministic, and to avoid touching the heap at all in the average case. (The worst case memory usage of an Arc<Vec> can be arbitrarily large and even in the average case will be significantly more than just the raw bytes, because it is two objects on the heap.

I agree that this is an optimization that does not matter at all unless you are running on very low powered hardware and/or have a very large number of multiaddrs sticking around. But if multiaddr is supposed to be a general purpose crate and not just a support library for parity, then these situations can happen. E.g. some of the ipfs nodes used by pinning services like ipfs pinata are keeping a quite large number of connections open to work around deficiencies of the current content routing.

Besides, I think as far as microoptimizations go, this one is pretty harmless and self contained. It does not add unsafe code or additional dependencies (except for the SmallVec, which I could also remove again). The optimization was originally done for multi hash, for which it definitely matters because an ipfs node will keep a very large number of multihashes around. I thought it would be nice to carry it over to multiaddr, where admittedly it matters a bit less.

tomaka · 2020-03-30T16:56:55Z

But if multiaddr is supposed to be a general purpose crate and not just a support library for parity, then these situations can happen. E.g. some of the ipfs nodes used by pinning services like ipfs pinata are keeping a quite large number of connections open to work around deficiencies of the current content routing.

If anything, this change ties the multiaddr more to a specific usage.
If you are storing a lot Multiaddrs that are larger than 30 bytes, then this PR is strictly negative, as every single instance of Multiaddr is now at least 30 bytes in size.
Anything different than Multiaddr { _: Vec<u8> } is opinionated over how it is being used.

some of the ipfs nodes used by pinning services like ipfs pinata are keeping a quite large number of connections open to work around deficiencies of the current content routing.
for which it definitely matters because an ipfs node will keep a very large number of multihashes around

I don't understand. If you're saying that you are keeping a lot of multihashes alive, then this PR will worsen your memory usage, and cloning an inline multiaddr/multihash is presumably more expensive than cloning an Arc.

rklaehn · 2020-03-30T17:48:52Z

But if multiaddr is supposed to be a general purpose crate and not just a support library for parity, then these situations can happen. E.g. some of the ipfs nodes used by pinning services like ipfs pinata are keeping a quite large number of connections open to work around deficiencies of the current content routing.

If anything, this change ties the multiaddr more to a specific usage.
If you are storing a lot Multiaddrs that are larger than 30 bytes, then this PR is strictly negative, as every single instance of Multiaddr is now at least 30 bytes in size.
Anything different than Multiaddr { _: Vec<u8> } is opinionated over how it is being used.

If you have large multiaddrs and do not benefit from the inline case, a Multiaddr that is essentially an Arc<[u8]> is still more memory efficient than a multiaddr that is a Vec<u8>. The heap allocated byte array is often larger than the actual storage that is required. E.g. you have a multiaddr that is 34 bytes, the buffer backing the vec will probably be 64 bytes due to the vector being grown in increments of 2. If your multiaddr is a result of some push and pop, the buffer backing the vec might even be much larger.

some of the ipfs nodes used by pinning services like ipfs pinata are keeping a quite large number of connections open to work around deficiencies of the current content routing.
for which it definitely matters because an ipfs node will keep a very large number of multihashes around

I don't understand. If you're saying that you are keeping a lot of multihashes alive, then this PR will worsen your memory usage,

If you have a lot of distinct multiaddrs that fit the inline case, this will make memory usage better since you don't touch the heap at all. If you have a lot of distinct multiaddrs that do not fit the inline case, it will still make things better over an Arc<Vec<u8>>, which is the current state, due to 1 instead of 2 objects on the heap.

and cloning an inline multiaddr/multihash is presumably more expensive than cloning an Arc.

Unless you go to very large inline multiaddrs, it is significantly cheaper than cloning an Arc. See attached benchmark which compares the inline case (copying 32 bytes) to the non-inline case (increasing the strong count of the arc).

But this whole thing is not worth having a heated discussion about. I think this is a small but in some cases significant improvement. If you think this is a bad tradeoff and don't want to merge this, I am fine with that as well.

rklaehn · 2020-03-30T20:01:30Z

Just out of curiosity, I ran a small example that just allocates a bunch of multiaddrs, on this branch and on master, for the inline and non-inline case, and ran heaptrack on it.

use multiaddr::Multiaddr;
fn main() {
    let inline = "/dns4/01234567890123456789123/tcp/80/ws";
    let heap = "/dns4/0123456789012345678901234/tcp/80/ws";
    let mut t: Vec<Multiaddr> = Vec::new();
    for _ in 0..1000000 {
        // t.push(heap.parse().unwrap());
        t.push(inline.parse().unwrap());
    }
}

This branch / inline

peak heap memory consumption:
33,6 MB after 1.737s

This branch / heap

peak heap memory consumption:
81,6 MB after 3.423s

master / same size as inline case above

peak heap memory consumption:
98,5 MB after 3.469s

master / same size as heap case above

peak heap memory consumption:
102,5 MB after 3.352s

So in the inline case, this branch consumes quite a bit less memory than master. But also in the heap case, this branch consumes a bit less memory than master, presumably because Arc<[u8]> is just a single object on the heap, vs. 2 for Arc<Vec<u8>>.

Not sure if there is a better way to measure the actual heap consumption including overhead. Was looking for a working version of https://docs.rs/heapsize/0.4.2/heapsize/ that works with current rust, but could not find anything quickly.

rklaehn · 2020-05-28T06:58:50Z

@tomaka I am going to close this. I think doing this in a way that is more agreeable to you is beyond my current rust capabilities.

Ideally you would want to separate the notion of a multiaddr (which is just a refinement of [u8]) and the concern of storing things on the stack if they are too small for the heap completely, correct?

That would be a bit more work, which might not be worth it...

rklaehn force-pushed the rkl/multiaddr-inline-storage branch from 3dc7801 to c84042f Compare March 24, 2020 20:42

rklaehn commented Mar 24, 2020

View reviewed changes

misc/multiaddr/src/storage.rs Outdated Show resolved Hide resolved

rklaehn changed the title ~~Replace Arc<Vec<u8>> with inline storage~~ Multiaddr: Replace Arc<Vec<u8>> with inline storage Mar 25, 2020

mxinden reviewed Mar 25, 2020

View reviewed changes

misc/multiaddr/src/storage.rs Outdated Show resolved Hide resolved

misc/multiaddr/src/storage.rs Outdated Show resolved Hide resolved

rklaehn force-pushed the rkl/multiaddr-inline-storage branch from 73cad48 to cdbf554 Compare March 26, 2020 09:54

twittner reviewed Mar 26, 2020

View reviewed changes

misc/multiaddr/src/lib.rs Outdated Show resolved Hide resolved

twittner reviewed Mar 26, 2020

View reviewed changes

rklaehn force-pushed the rkl/multiaddr-inline-storage branch from 48def4c to 7e96076 Compare March 26, 2020 17:15

rklaehn and others added 7 commits March 26, 2020 18:41

Replace Arc<Vec<u8>> with Storage

59bec40

Storage stores small bytes inline, and spills to the heap only once that is no longer possible.

Fix typo in docs

8b2bf3c

Co-Authored-By: Max Inden <mail@max-inden.de>

Change max inline to 30 bytes and document assumptions

a3440b1

Add small bench and add parity copyright header to storage

b92dbe1

Use smallvec to avoid some allocations in push and with

a30038c

Fix typo in docs

b2c6aa2

Co-Authored-By: Toralf Wittner <tw@dtex.org>

Use write feature of smallvec instead of Buffer wrapper

0c0485f

rklaehn force-pushed the rkl/multiaddr-inline-storage branch from 7e96076 to 0c0485f Compare March 26, 2020 17:44

rklaehn closed this May 28, 2020

Multiaddr: Replace Arc<Vec<u8>> with inline storage #1510

Multiaddr: Replace Arc<Vec<u8>> with inline storage #1510

Uh oh!

Conversation

rklaehn commented Mar 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rklaehn commented Mar 25, 2020

Uh oh!

mxinden left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rklaehn commented Mar 26, 2020

Uh oh!

rklaehn commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

twittner Mar 26, 2020

Choose a reason for hiding this comment

Uh oh!

rklaehn Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rklaehn Mar 26, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

twittner Mar 26, 2020

Choose a reason for hiding this comment

Uh oh!

rklaehn Mar 26, 2020

Choose a reason for hiding this comment

Uh oh!

twittner commented Mar 26, 2020

Uh oh!

twittner commented Mar 26, 2020

Uh oh!

Uh oh!

Uh oh!

twittner Mar 26, 2020

Choose a reason for hiding this comment

Uh oh!

rklaehn Mar 26, 2020

Choose a reason for hiding this comment

Uh oh!

rklaehn Mar 27, 2020

Choose a reason for hiding this comment

Uh oh!

tomaka commented Mar 30, 2020

Uh oh!

rklaehn commented Mar 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomaka commented Mar 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rklaehn commented Mar 30, 2020

Uh oh!

rklaehn commented Mar 30, 2020

Uh oh!

rklaehn commented May 28, 2020

Uh oh!

Uh oh!

rklaehn commented Mar 24, 2020 •

edited

Loading

rklaehn commented Mar 26, 2020 •

edited

Loading

rklaehn Mar 26, 2020 •

edited

Loading

rklaehn commented Mar 30, 2020 •

edited

Loading

tomaka commented Mar 30, 2020 •

edited

Loading