Skip to content

Multiaddr: Replace Arc<Vec<u8>> with inline storage #1510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

rklaehn
Copy link
Contributor

@rklaehn rklaehn commented Mar 24, 2020

Storage stores small bytes inline, and spills to the heap only once that is no longer possible.

A typical small multiaddr consisting of e.g. ip4 host/port, ip6 host/port etc. will be stored inline. Only some complex multiaddrs containing multiple peer ids will be stored on the heap as an Arc<[u8]>.

The push and with fns allocate temporary vectors, but I think this is an acceptable tradeoff for making the in memory representation of multiaddrs much smaller. Many of the multiadds that you have you don't ever manually build but get from somewhere else.

@rklaehn rklaehn force-pushed the rkl/multiaddr-inline-storage branch from 3dc7801 to c84042f Compare March 24, 2020 20:42
@rklaehn rklaehn changed the title Replace Arc<Vec<u8>> with inline storage Multiaddr: Replace Arc<Vec<u8>> with inline storage Mar 25, 2020
@rklaehn
Copy link
Contributor Author

rklaehn commented Mar 25, 2020

Note: this is basically identical to the optimization that has been done in rust-multihash in multiformats/rust-multihash#47

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what my review is worth this looks good to me.

I would prefer every performance optimization to come with a benchmark preventing future-us to introduce performance regressions. Something along the lines of https://gist.github.com/vmx/0dc0d981b522e2834084559faa0dcb56. What do you all think?

@rklaehn
Copy link
Contributor Author

rklaehn commented Mar 26, 2020

@mxinden I can add a clone benchmark that compares cloning multiaddrs that are slightly below the limit to multiadds that are slightly above the limit. But it will be hard to integrate this into CI, so it would just be something you run on demand when you want to mess with this part of the code...

@rklaehn
Copy link
Contributor Author

rklaehn commented Mar 26, 2020

Added a small bench and a test that shows where the limits of inline storage are.

Here are the results:

$ cargo bench --bench clone
clone 10 max inlined   time:   [45.106 ns 45.587 ns 46.119 ns]                                   
clone 10 min heap      time:   [135.29 ns 135.86 ns 136.50 ns]                               

Note that the main point of the optimization is to avoid touching the heap and reducing the total memory usage, but a factor 3 speedup in cloning is a nice benefit. Since cloning involves just increasing the strong_count of the arc, it is independent of size for the heap case.

@rklaehn rklaehn force-pushed the rkl/multiaddr-inline-storage branch from 73cad48 to cdbf554 Compare March 26, 2020 09:54
/// a multiaddr containing an ipv4 or ipv6 address and port.
///
/// More complex multiaddrs like those containing peer ids will be stored on the heap.
const MAX_INLINE: usize = 30;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not 62? Most peer IDs would fit in there as well and peer IDs are quite a common case.

Copy link
Contributor Author

@rklaehn rklaehn Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it 62. I have no idea about the size distribution of the typical multiaddr. From what I have seen in rust-libp2p, and for what I intend to use this for, you often retain addresses for other nodes, which are just host/port/protocol. If you say that the ones containing a peer id are sufficiently common, I will change this to 62.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless large multiaddrs are way more popular than I think, 64 would be too big. If you go to the arced version for large multihashes, it is not the end of the world. And there are situations where having lots of 64 byte objects would be worse than having a bunch of pointers to arcs.

Some(protocol)
}

/// Like [`Multiaddr::push`] but consumes `self`.
pub fn with(mut self, p: Protocol<'_>) -> Self {
let mut w = io::Cursor::<&mut Vec<u8>>::new(Arc::make_mut(&mut self.bytes));
let mut w = io::Cursor::new(self.to_vec());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not avoid this allocation if p fits into Storage::Inline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@twittner
Copy link
Contributor

Use smallvec to avoid some allocations in push and with

Sorry, what I had in mind was to use SmallVec everywhere instead of Storage. From my perspective all that is needed compared to master is this change:

diff --git a/misc/multiaddr/Cargo.toml b/misc/multiaddr/Cargo.toml
index c30b4504..a6ce960d 100644
--- a/misc/multiaddr/Cargo.toml
+++ b/misc/multiaddr/Cargo.toml
@@ -16,6 +16,7 @@ data-encoding = "2.1"
 multihash = "0.10"
 percent-encoding = "2.1.0"
 serde = "1.0.70"
+smallvec = { version = "1.0", features = ["write"] }
 static_assertions = "1.1"
 unsigned-varint = "0.3"
 url = { version = "2.1.0", default-features = false }
diff --git a/misc/multiaddr/src/lib.rs b/misc/multiaddr/src/lib.rs
index f81b8ada..58906a48 100644
--- a/misc/multiaddr/src/lib.rs
+++ b/misc/multiaddr/src/lib.rs
@@ -14,10 +14,10 @@ use serde::{
     Serializer,
     de::{self, Error as DeserializerError}
 };
+use smallvec::SmallVec;
 use std::{
     convert::TryFrom,
     fmt,
-    io,
     iter::FromIterator,
     net::{IpAddr, Ipv4Addr, Ipv6Addr},
     result::Result as StdResult,
@@ -35,19 +35,22 @@ static_assertions::const_assert! {
     std::mem::size_of::<usize>() <= std::mem::size_of::<u64>()
 }
 
+/// Size to use with `SmallVec`.
+const INLINE_SIZE: usize = 32;
+
 /// Representation of a Multiaddr.
 #[derive(PartialEq, Eq, Clone, Hash)]
-pub struct Multiaddr { bytes: Arc<Vec<u8>> }
+pub struct Multiaddr { bytes: Arc<SmallVec<[u8; INLINE_SIZE]>> }
 
 impl Multiaddr {
     /// Create a new, empty multiaddress.
     pub fn empty() -> Self {
-        Self { bytes: Arc::new(Vec::new()) }
+        Self { bytes: Arc::new(SmallVec::new()) }
     }
 
     /// Create a new, empty multiaddress with the given capacity.
     pub fn with_capacity(n: usize) -> Self {
-        Self { bytes: Arc::new(Vec::with_capacity(n)) }
+        Self { bytes: Arc::new(SmallVec::with_capacity(n)) }
     }
 
     /// Return the length in bytes of this multiaddress.
@@ -73,9 +76,8 @@ impl Multiaddr {
     /// ```
     ///
     pub fn push(&mut self, p: Protocol<'_>) {
-        let mut w = io::Cursor::<&mut Vec<u8>>::new(Arc::make_mut(&mut self.bytes));
-        w.set_position(w.get_ref().len() as u64);
-        p.write_bytes(&mut w).expect("Writing to a `io::Cursor<&mut Vec<u8>>` never fails.")
+        let mut w = Arc::make_mut(&mut self.bytes);
+        p.write_bytes(&mut w).expect("Writing to a `SmallVec` never fails.")
     }
 
     /// Pops the last `Protocol` of this multiaddr, or `None` if the multiaddr is empty.
@@ -107,9 +109,8 @@ impl Multiaddr {
 
     /// Like [`Multiaddr::push`] but consumes `self`.
     pub fn with(mut self, p: Protocol<'_>) -> Self {
-        let mut w = io::Cursor::<&mut Vec<u8>>::new(Arc::make_mut(&mut self.bytes));
-        w.set_position(w.get_ref().len() as u64);
-        p.write_bytes(&mut w).expect("Writing to a `io::Cursor<&mut Vec<u8>>` never fails.");
+        let mut w = Arc::make_mut(&mut self.bytes);
+        p.write_bytes(&mut w).expect("Writing to a `SmallVec` never fails.");
         self
     }
 
@@ -212,7 +213,7 @@ impl<'a> FromIterator<Protocol<'a>> for Multiaddr {
     where
         T: IntoIterator<Item = Protocol<'a>>,
     {
-        let mut writer = Vec::new();
+        let mut writer = SmallVec::new();
         for cmp in iter {
             cmp.write_bytes(&mut writer).expect("Writing to a `Vec` never fails.");
         }
@@ -224,7 +225,7 @@ impl FromStr for Multiaddr {
     type Err = Error;
 
     fn from_str(input: &str) -> Result<Self> {
-        let mut writer = Vec::new();
+        let mut writer = SmallVec::new();
         let mut parts = input.split('/').peekable();
 
         if Some("") != parts.next() {
@@ -262,7 +263,7 @@ impl<'a> Iterator for Iter<'a> {
 
 impl<'a> From<Protocol<'a>> for Multiaddr {
     fn from(p: Protocol<'a>) -> Multiaddr {
-        let mut w = Vec::new();
+        let mut w = SmallVec::new();
         p.write_bytes(&mut w).expect("Writing to a `Vec` never fails.");
         Multiaddr { bytes: Arc::new(w) }
     }
@@ -299,7 +300,7 @@ impl TryFrom<Vec<u8>> for Multiaddr {
             let (_, s) = Protocol::from_bytes(slice)?;
             slice = s
         }
-        Ok(Multiaddr { bytes: Arc::new(v) })
+        Ok(Multiaddr { bytes: Arc::new(SmallVec::from(v)) })
     }
 }

@twittner
Copy link
Contributor

Ah sorry, that makes no sense. We still do the Arc.

}

#[test]
fn roundtrip() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test not redundant given the property test below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this one is guaranteed to hit all the interesting cases, whereas with the property based test you depend on luck. But this was copied from multihash, which is an independent repository. So if you want to reduce the number of tests for rust-libp2p, which is a bit bigger, I can remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are concerned about the test runtime, it would probably be best to keep this one and drop the property based test.

@rklaehn rklaehn force-pushed the rkl/multiaddr-inline-storage branch from 48def4c to 7e96076 Compare March 26, 2020 17:15
@rklaehn rklaehn force-pushed the rkl/multiaddr-inline-storage branch from 7e96076 to 0c0485f Compare March 26, 2020 17:44
@tomaka
Copy link
Member

tomaka commented Mar 30, 2020

This comment is kind of irrelevant here since we're also using Bytes, but I'm really not a fan of this kind of magic and tend to prefer explicitness.

To me, instead of optimizing the insides of the Multiaddr we should either avoid cloning multiaddresses or use a Arc<Multiaddr>.

I also strongly think that we're micro-optimizing here. Benchmarking how much time it takes to clone a Multiaddr is based upon the assumption that we're spending a lot of time cloning multiaddresses, while my guess is that this is probably using not even 0.1% of the CPU time of a typical libp2p application.

That being said I don't care that much and I'm not opposed to merging this.

@rklaehn
Copy link
Contributor Author

rklaehn commented Mar 30, 2020

Just to be clear, the purpose of this change is not to make clone faster. That is just a welcome but not very important side effect. I am perfectly happy with the speed of cloning an Arc.

The purpose of the optimization is to reduce the memory usage of a multiaddr, make memory usage more deterministic, and to avoid touching the heap at all in the average case. (The worst case memory usage of an Arc<Vec> can be arbitrarily large and even in the average case will be significantly more than just the raw bytes, because it is two objects on the heap.

I agree that this is an optimization that does not matter at all unless you are running on very low powered hardware and/or have a very large number of multiaddrs sticking around. But if multiaddr is supposed to be a general purpose crate and not just a support library for parity, then these situations can happen. E.g. some of the ipfs nodes used by pinning services like ipfs pinata are keeping a quite large number of connections open to work around deficiencies of the current content routing.

Besides, I think as far as microoptimizations go, this one is pretty harmless and self contained. It does not add unsafe code or additional dependencies (except for the SmallVec, which I could also remove again). The optimization was originally done for multi hash, for which it definitely matters because an ipfs node will keep a very large number of multihashes around. I thought it would be nice to carry it over to multiaddr, where admittedly it matters a bit less.

@tomaka
Copy link
Member

tomaka commented Mar 30, 2020

But if multiaddr is supposed to be a general purpose crate and not just a support library for parity, then these situations can happen. E.g. some of the ipfs nodes used by pinning services like ipfs pinata are keeping a quite large number of connections open to work around deficiencies of the current content routing.

If anything, this change ties the multiaddr more to a specific usage.
If you are storing a lot Multiaddrs that are larger than 30 bytes, then this PR is strictly negative, as every single instance of Multiaddr is now at least 30 bytes in size.
Anything different than Multiaddr { _: Vec<u8> } is opinionated over how it is being used.

some of the ipfs nodes used by pinning services like ipfs pinata are keeping a quite large number of connections open to work around deficiencies of the current content routing.
for which it definitely matters because an ipfs node will keep a very large number of multihashes around

I don't understand. If you're saying that you are keeping a lot of multihashes alive, then this PR will worsen your memory usage, and cloning an inline multiaddr/multihash is presumably more expensive than cloning an Arc.

@rklaehn
Copy link
Contributor Author

rklaehn commented Mar 30, 2020

But if multiaddr is supposed to be a general purpose crate and not just a support library for parity, then these situations can happen. E.g. some of the ipfs nodes used by pinning services like ipfs pinata are keeping a quite large number of connections open to work around deficiencies of the current content routing.

If anything, this change ties the multiaddr more to a specific usage.
If you are storing a lot Multiaddrs that are larger than 30 bytes, then this PR is strictly negative, as every single instance of Multiaddr is now at least 30 bytes in size.
Anything different than Multiaddr { _: Vec<u8> } is opinionated over how it is being used.

If you have large multiaddrs and do not benefit from the inline case, a Multiaddr that is essentially an Arc<[u8]> is still more memory efficient than a multiaddr that is a Vec<u8>. The heap allocated byte array is often larger than the actual storage that is required. E.g. you have a multiaddr that is 34 bytes, the buffer backing the vec will probably be 64 bytes due to the vector being grown in increments of 2. If your multiaddr is a result of some push and pop, the buffer backing the vec might even be much larger.

some of the ipfs nodes used by pinning services like ipfs pinata are keeping a quite large number of connections open to work around deficiencies of the current content routing.
for which it definitely matters because an ipfs node will keep a very large number of multihashes around

I don't understand. If you're saying that you are keeping a lot of multihashes alive, then this PR will worsen your memory usage,

If you have a lot of distinct multiaddrs that fit the inline case, this will make memory usage better since you don't touch the heap at all. If you have a lot of distinct multiaddrs that do not fit the inline case, it will still make things better over an Arc<Vec<u8>>, which is the current state, due to 1 instead of 2 objects on the heap.

and cloning an inline multiaddr/multihash is presumably more expensive than cloning an Arc.

Unless you go to very large inline multiaddrs, it is significantly cheaper than cloning an Arc. See attached benchmark which compares the inline case (copying 32 bytes) to the non-inline case (increasing the strong count of the arc).

But this whole thing is not worth having a heated discussion about. I think this is a small but in some cases significant improvement. If you think this is a bad tradeoff and don't want to merge this, I am fine with that as well.

@rklaehn
Copy link
Contributor Author

rklaehn commented Mar 30, 2020

Just out of curiosity, I ran a small example that just allocates a bunch of multiaddrs, on this branch and on master, for the inline and non-inline case, and ran heaptrack on it.

use multiaddr::Multiaddr;
fn main() {
    let inline = "/dns4/01234567890123456789123/tcp/80/ws";
    let heap = "/dns4/0123456789012345678901234/tcp/80/ws";
    let mut t: Vec<Multiaddr> = Vec::new();
    for _ in 0..1000000 {
        // t.push(heap.parse().unwrap());
        t.push(inline.parse().unwrap());
    }
}

This branch / inline

peak heap memory consumption:
33,6 MB after 1.737s

This branch / heap

peak heap memory consumption:
81,6 MB after 3.423s

master / same size as inline case above

peak heap memory consumption:
98,5 MB after 3.469s

master / same size as heap case above

peak heap memory consumption:
102,5 MB after 3.352s

So in the inline case, this branch consumes quite a bit less memory than master. But also in the heap case, this branch consumes a bit less memory than master, presumably because Arc<[u8]> is just a single object on the heap, vs. 2 for Arc<Vec<u8>>.

Not sure if there is a better way to measure the actual heap consumption including overhead. Was looking for a working version of https://docs.rs/heapsize/0.4.2/heapsize/ that works with current rust, but could not find anything quickly.

@rklaehn
Copy link
Contributor Author

rklaehn commented May 28, 2020

@tomaka I am going to close this. I think doing this in a way that is more agreeable to you is beyond my current rust capabilities.

Ideally you would want to separate the notion of a multiaddr (which is just a refinement of [u8]) and the concern of storing things on the stack if they are too small for the heap completely, correct?

That would be a bit more work, which might not be worth it...

@rklaehn rklaehn closed this May 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants