-
Notifications
You must be signed in to change notification settings - Fork 292
Merge feature/perf to master #6229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
No functional change. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
No functional change. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…queue Wake up the scheduler immediately when there are more tasks. Otherwise timeouts <10s may not work correctly, and it is difficult to test the periodic scheduler if you need to wait 10s for it to start working. If there are no tasks, then it will still sleep efficiently, but as soon as more tasks are added (with [~signal:true], which is the default) it will immediately wake up and calculate the next sleep time. In practice it is probably quite rare for XAPI's queue to be empty (there are usually periodic tasks), but we cannot rely on this. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…internal communication Feature flag: use-xmlrpc Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…owercase Tasks are lowercase Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Subscription.object_matches already does it Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Creates a new library `Tgroup`, that abstracts and manages groups of execution threads in xapi. When xapi is under load, all the threads need to share a single cpu in dom0 because of ocaml runtime single-cpu restrictions. This library is meant to orchestrate the threads in different priority groups. Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
`set_cgroup` adds the functionality of adding the current thread in a cgroup based on its creator. Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Add functionality of setting the tgroup based on a http header named `originator`. Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Clears the `extra_headers` of `UDSTransport` instance when making a connection. Previously, this was done only when tracing was enabled inside the `with_tracecontext` method to avoid the header duplication when `make_connection` was used multiple times. Currently, there is not other use of `add_extra_headers` or other update to the `_extra_headers`, making it safe to clear it when we make a new connection. (`xmlrpclib.Transport` updates the `_extra_headers` attribute only inside `make_connection` method but we override this method with our own for `UDSTransport`.) Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
XenAPI.py now passes an additional originator header when making requests to xapi, if the "ORIGINATOR" env var is present. Sm_exec now passes an env var, "ORIGINATOR=SM". To classify the threads correctly, we first need to determine the requests originators. This commit makes it possibly to explicitly state the originators as headers. Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
For now, the thread executing `Xapi.server_init` and it's children are classified as External. The only excception are http requests that come through the smapi internally. If those contain the originator header with the value set as "sm", the thread executing the request will be classified as internal. This represents the first phase of classifing xapi threads as internal vs external. Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Adds a configurable variable in `xapi_globs`, `tgroups_enabled` that is meant to ask a guard for tgroup classification of the threads. If the guard is `false` all Tgroups functionality should act as a no op. For instance, adding the line: tgroups-enabled = false will result in the thread classification being skipped. Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
…n flag This is more efficient: we can watch a single task, instead of everything in the DB. Feature-flag: use-event-next No functional change. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
int32_of_rpc doesn't accept Int32 as input, just Int because none of the current deserializers actually produce an Int32. (Int32 is only used by serializers to emit something different). This is an upstream ocaml-rpc bug that should be fixed, meanwhile convert Rpc.Int32 to Rpc.Int, so that the 'fake_rpc' inside XAPI can use Event.from. Otherwise you get this error: ``` Expected int32, got 'I32(0) ``` Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…vent.next This is more efficient: we can watch a single task, instead of everything in the DB. Feature-flag: use-event-next Signed-off-by: Edwin Török <edwin.torok@cloud.com>
… of Event.next Feature flag: use-event-next Signed-off-by: Edwin Török <edwin.torok@cloud.com>
) When XAPI's queue is empty the periodic scheduler would sleep for 10s, during which time newly added jobs would get ignored. This is mostly noticeable during unit tests, but we cannot rely on the queue not being empty in production either. It is better to wake up the scheduler on the soonest of these 2 events: * after 10s * when a new task gets added to the queue
Previously it'd only run when we added or removed entries, but on an idle system we'd keep a large number of connections open that we don't need, and this then exceeded the connection limits on the coordinator. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Introduce separate coordinator_max_stunnel_cache and member_max_stunnel_cache settings, and set these on XAPI startup based on host role. No functional change, the defaults for both match the previous hardcoded value (70). Signed-off-by: Edwin Török <edwin.torok@cloud.com>
To avoid labeled argument with ' in name Signed-off-by: Edwin Török <edwin.torok@cloud.com>
This parameter was passed through unchanged. However VDI.epoch_begin/epoch_end doesn't actually need it, so drop it. We intend to change how we compute the 'domain' parameter, and VDI.epoch_begin/end wouldn't have sufficient information to compute it anymore. No functional change. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
We are going to change how we compute it, so factor it out into a function. No functional change. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
No feature flag, because this is a deprecated API. Clients that wants the best performance should've used Event.from. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
This delay was right after we waited for a new event, delaying all event responses by 50ms (including task completions). Eliminate the first delay, so that if we find the event we're looking after the DB update, then we can return immediately. On spurious wakeups (e.g. not the event we subscribed for) the delay is still useful, so keep it for recursive calls after the first one, and exponentially increase it up to the configured maximum. No feature flag, this is a relatively small change, and we use exponential backoffs elsewhere in XAPI already. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Give an opportunity for more fields to be filled, e.g. when waiting for a task to complete, give a chance for the task to actually run. No feature flag, it only changes timing. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
We generate O(N^2) events when we update O(N) fields: each field update generates an event including the entire object, even if later we are going to change other fields of the same object. Instead of returning the individual field update events immediately (and generating a storm of events whenever an API client watcher for VM power events), we batch these event updates by introducing a minimum amount of time that successive Event.from need to have between them. (The client is working as expected here: when it gets an event and processes it, it immediately calls Event.from to get more events) Although this doesn't guarantee to eliminate the O(N^2) problem, in practice it reduces the overhead significantly. There is one case where we do want almost immediately notification of updates: task completions (because then the client likely wants to send us more tasks). This PR makes the already existing rate limiting in Xapi_event consistent and configurable, but doesn't yet introduce a batching delay for Event.from (it does for Event.next, which is deprecated). A separate PR (or config change) can then enable this for testing purposes, but also allows us to roll the change back by changing the tunable in the config file. There is also a new microbenchmark introduced here, I'll need to update that with the latest results.
Uses Gc.Memprof to run a hook function periodically. This checks whether we are holding any locks, and if not and sufficient time has elapsed since the last, then we yield. POSIX timers wouldn't work here, because they deliver signals, and there are too many places in XAPI that don't handle EINTR properly. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
And apply on startup. Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Adds more granurality in the xapi thread classification. Now, an individual thread can be mapped to the following: - {xapi_cgroup}/internal/SM; - {xapi_cgroup}/internal/cli; - {xapi_cgroup}/external/intrapool; - {xapi_cgroup}/external/unauthenticated; - {xapi_cgroup}/external/authenticated/{identity}; where {identity} is a {auth_user_sid}/{user_agent}. Both {auth_user_sid} and {user_agent} strings are sanitized when making the identity through `Identity.make`, by allowing only alphanumenric characters and a maximum length of 16 characters each. This is only the library change. This should allow for proper thread classification in xapi. Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Adds unit test for: - the correct thread classification of `of_creator`; - sanitation of `Identity.make`. Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Classifies the threads at the time of session creation and inside `do_dispatch`. This ensures that new threads created by current session/request inherit the propper classification. Note: threads created by xenopsd calling back into xapi are yet to be classified. Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Adds more granurality in the xapi thread classification. Now, an individual thread can be mapped to the following: - {xapi_cgroup}/internal/SM; - {xapi_cgroup}/internal/cli; - {xapi_cgroup}/external/intrapool; - {xapi_cgroup}/external/unauthenticated; - {xapi_cgroup}/external/authenticated/{identity}; where {identity} is a {auth_user_sid}/{user_agent}. Both {auth_user_sid} and {user_agent} strings are sanitized when making the identity through `Identity.make`, by allowing only alphanumeric characters and a maximum length of 16 characters each. This should allow for proper thread classification in xapi. Note: Threads calling back into xapi from xenopsd are yet to be classified. e.g. Cgroup hierarchy based on the new classification  BVT: 208947 & BST: 208972
Fixed conflict in scheduler.ml (code that got deleted in feature/perf got modified in master)
Continuing from #6208 The conflict resolution can be reviewed with: `git log --remerge-diff -1 8b8af63` ```diff diff --git a/ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/scheduler.ml b/ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/scheduler.ml remerge CONFLICT (content): Merge conflict in ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/scheduler.ml index 7bace30dae..27cf306 100644 --- a/ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/scheduler.ml +++ b/ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/scheduler.ml @@ -33,49 +33,8 @@ let (queue : t Ipq.t) = Ipq.create 50 queue_default let lock = Mutex.create () -<<<<<<< 6589d9a (Xapi thread classification - part 2 (#6154)) let add_to_queue_span name ty start_span newfunc = let ( ++ ) = Mtime.Span.add in -||||||| 4f3f08f -module Clock = struct - let span s = Mtime.Span.of_uint64_ns (Int64.of_float (s *. 1e9)) - - let span_to_s span = - Mtime.Span.to_uint64_ns span |> Int64.to_float |> fun ns -> ns /. 1e9 - - let add_span clock secs = - (* return mix or max available value if the add overflows *) - match Mtime.add_span clock (span secs) with - | Some t -> - t - | None when secs > 0. -> - Mtime.max_stamp - | None -> - Mtime.min_stamp -end - -let add_to_queue name ty start newfunc = - let ( ++ ) = Clock.add_span in -======= -module Clock = struct - let span s = Mtime.Span.of_uint64_ns (Int64.of_float (s *. 1e9)) - - let span_to_s span = Mtime.Span.to_float_ns span |> fun ns -> ns /. 1e9 - - let add_span clock secs = - (* return mix or max available value if the add overflows *) - match Mtime.add_span clock (span secs) with - | Some t -> - t - | None when secs > 0. -> - Mtime.max_stamp - | None -> - Mtime.min_stamp -end - -let add_to_queue name ty start newfunc = - let ( ++ ) = Clock.add_span in ->>>>>>> 8c9b754 (SDK fixes including CP-53003 (#6210)) let item = { Ipq.ev= {func= newfunc; ty; name} ``` This code got deleted in `feature/perf` 6b02474, where `span_to_s` got replaced with `Clock.Timer.span_to_s`, and changed in master by e68cda7. Both achieve the same result: `span_to_s` uses `Mtime.Span.to_float_ns`, except the commit on feature/perf also removes code duplication by reusing the other clock module.
# Changing the default OCaml thread switch timeslice from 50ms The default OCaml 4.x timeslice for switching between threads is 50ms: if there is more than 1 active OCaml threads each one is let to run up to 50ms, and then (at various safepoints) it can switch to another running thread. When the runtime lock is released (and C code or syscalls run) then another OCaml thread is immediately let to run if any. However 50ms is too long, and it inserts large latencies into the handling of API calls. OTOH if a timeslice is too short then we waste CPU time: * overhead of Thread.yield system call, and the cost of switching threads at the OS level * potentially higher L1/L2 cache misses if we switch on the same CPU between multiple OCaml threads * potentially losing branch predictor history * potentially higher L3 cache misses (but on a hypervisor with VMs running L3 will be mostly taken up by VMs anyway, we can only rely on L1/L2 staying with us) A microbenchmark has shown that timeslices as small as 0.5ms might strike an optimal balance between latency and overhead: values lower than that lose performance due to increased overhead, and values higher than that lose performance due to increased latency:   (the microbenchmark measures the number of CPU cycles spent simulating an API call with various working set sizes and timeslice settings) This is all hardware dependent though, and a future PR will introduce an autotune service that measures the yield overhead and L1/L2 cache refill overhead and calculates an optimal timeslice for that particular hardware/Xen/kernel combination. (and while we're at it, we can also tweak the minor heap size to match ~half of CPU L2 cache). # Timeslice change mechanism Initially I used `Unix.set_itimer` using virtual timers, to switch a thread only when it has been actively using CPU for too long. However that relies on delivering a signal to the process, and XAPI is very bad at handling signals. In fact XAPI is not allowed to receive any signals, because it doesn't handle EINTR well (a typical problem, that affects C programs too sometimes). Although this is a well understood problem (described in the [OCaml Unix book](https://ocaml.github.io/ocamlunix/ocamlunix.html#sec88), and some areas of XAPI make an effort to handle it, others just assert that they never receive one. Fixing that would require changes in all of XAPI (and its dependencies). So instead I don't use signals at all, but rely on Statmemprof to trigger a hook to be executed "periodically", but not based purely on time, but on allocation activity (i.e. at places the GC could run). The hook checks the elapsed time since the last time it got called, and if too much then calls Thread.yield. Yield is smart enough to be a no-op if there aren't any other runnable OCaml threads. Yield isn't always beneficial though at reducing latencies, e.g. if we are holding locks then we're just increasing latency for everyone who waits for that lock. So a mechanism is introduced to notify the periodic function when any highly contended locks are held, and the yield is skipped in this instance (e.g. the XAPI DB lock). # Plotting code This PR only includes a very simplified version of the microbenchmark, a separate one will introduce the full cache plotting code (which is useful for development/troubleshooting purposes but won't be needed at runtime). # Default timeslice value Set to 5ms for now, just a bit above 4ms = 1/HZ in our Dom0 kernel, the autotuner from a future PR can change this to a more appropriate value. (the autotuner needs more validation on a wider range of hardware) # Results The cache measurements needs to be repeated on a wider variety of hardware, but the timeslice changes here have already proven useful in reducing XAPI DB lock hold times (together with other optimizations).
…l used internally By default Event.next is still used internally, so although this API is deprecated do not yet enable the throttling by default. Fixes: 3e1d8a2 ("CP-51692: Event.next: use same batching as Event.from") Fixes: 2b4e0db ("CP-49158: [prep] Event.{from,next}: make delays configurable and prepare for task specific delays") It slows down all synchronous API calls that create tasks, like VM.start. Only enable the throttling when Event.next is not used internally (`use-event-next = false` in xapi.conf), which will eventually become the default. The code prior to the above changes used 0 delay between checking for events, so do the same here (although this lead to a lot of inefficient wakeups of all active tasks in XAPI, whenever anything changes, it matches previous behaviour) Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…l used internally (#6222) This fixes a performance regression on `feature/perf` with all feature flags disabled. By default Event.next is still used internally, so although this API is deprecated do not yet enable the throttling by default. Fixes: 3e1d8a2 ("CP-51692: Event.next: use same batching as Event.from") Fixes: 2b4e0db ("CP-49158: [prep] Event.{from,next}: make delays configurable and prepare for task specific delays") It slows down all synchronous API calls that create tasks, like VM.start. Only enable the throttling when Event.next is not used internally (`use-event-next = false` in xapi.conf), which will eventually become the default. The code prior to the above changes used 0 delay between checking for events, so do the same here (although this lead to a lot of inefficient wakeups of all active tasks in XAPI, whenever anything changes, it matches previous behaviour) (the code for Event.from used a 50ms delay, which matches the default for that setting already)
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
The conflict resolution can be reviewed by `git log --remerge-diff -1`: ```diff diff --git a/ocaml/xapi-storage-script/main.ml b/ocaml/xapi-storage-script/main.ml remerge CONFLICT (content): Merge conflict in ocaml/xapi-storage-script/main.ml index 2dea78f513..ae725b9 100644 --- a/ocaml/xapi-storage-script/main.ml +++ b/ocaml/xapi-storage-script/main.ml @@ -1502,17 +1502,11 @@ let bind ~volume_script_dir = |> wrap in S.VDI.attach3 vdi_attach3_impl ; -<<<<<<< 6ccaf7b (Update feature/perf with latest blocker fixes (#6237)) - let vdi_activate_common dbg dp sr vdi' vm readonly = -||||||| 1e5114c - let vdi_activate_common dbg dp sr vdi' vm' readonly = -======= let dp_attach_info_impl dbg sr vdi dp vm = vdi_attach3_impl dbg dp sr vdi vm () in S.DP.attach_info dp_attach_info_impl ; - let vdi_activate_common dbg dp sr vdi' vm' readonly = ->>>>>>> 6949dbd (Typo. Only throw assertions at Debug time. (#6262)) + let vdi_activate_common dbg dp sr vdi' vm readonly = (let vdi = Storage_interface.Vdi.string_of vdi' in let domain = domain_of ~dp ~vm in Attached_SRs.find sr >>>= fun sr -> @@ -1817,7 +1811,7 @@ let bind ~volume_script_dir = stat ~dbg ~sr ~vdi:temporary ) >>>= fun response -> - choose_datapath domain response >>>= fun (rpc, datapath, uri, domain) -> + choose_datapath response >>>= fun (rpc, datapath, uri) -> if Datapath_plugins.supports_feature datapath _vdi_mirror_in then return_data_rpc (fun () -> Datapath_client.import_activate (rpc ~dbg) dbg uri domain diff --git a/ocaml/xapi/xapi_globs.ml b/ocaml/xapi/xapi_globs.ml remerge CONFLICT (content): Merge conflict in ocaml/xapi/xapi_globs.ml index 3c96b1792a..c07c3d9 100644 --- a/ocaml/xapi/xapi_globs.ml +++ b/ocaml/xapi/xapi_globs.ml @@ -1692,7 +1692,6 @@ let other_options = diff --git a/ocaml/xapi/xapi_globs.ml b/ocaml/xapi/xapi_globs.ml remerge CONFLICT (content): Merge conflict in ocaml/xapi/xapi_globs.ml index 3c96b1792a..c07c3d9 100644 --- a/ocaml/xapi/xapi_globs.ml +++ b/ocaml/xapi/xapi_globs.ml @@ -1692,7 +1692,6 @@ let other_options = , (fun () -> string_of_bool !disable_webserver) , "Disable the host webserver" ) -<<<<<<< 6ccaf7b (Update feature/perf with latest blocker fixes (#6237)) ; ( "tgroups-enabled" , Arg.Set tgroups_enabled , (fun () -> string_of_bool !tgroups_enabled) @@ -1701,14 +1700,11 @@ let other_options = ; event_from_entry ; event_from_task_entry ; event_next_entry -||||||| 1e5114c -======= ; ( "drivertool" , Arg.Set_string driver_tool , (fun () -> !driver_tool) , "Path to drivertool for selecting host driver variants" ) ->>>>>>> 6949dbd (Typo. Only throw assertions at Debug time. (#6262)) ] (* The options can be set with the variable xapiflags in /etc/sysconfig/xapi. ```
Testing found no bugs unique to this feature branch (the bugs found are pre-existing bugs already affecting master). |
last-genius
approved these changes
Jan 31, 2025
robhoes
approved these changes
Jan 31, 2025
contificate
approved these changes
Jan 31, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Relevant feature flags are off by default/match previous values:
Draft PR to check conflicts, waiting for testing to complete.