-
Notifications
You must be signed in to change notification settings - Fork 34
Description
I've implemented feature related to decoupling from #37. Main commit can be viewed here. Below is commit message for convenience:
Change SharedTensor::read()
signature from
fn read(&self, device: &DeviceType) -> Result<&MemoryType, ...>
into
fn read<D: IDevice(&self, device: &D) -> Result<&D::M, ...>
New signature provides type-level guarantee that if a Cuda device is passed
into read(), then it'll return Cuda memory (and not Native or OpenCL).
Previously required additional unwraps (.as_native().unwrap()) are no
longer required, code is more clear and concise.
Internally SharedTensor
uses Any
type to store objects of different types
uniformely. Synchronization between memories is also done through type-erased
interface. This makes it possible to define a new Framework in an external
crate, or extract Cuda and OpenCL frameworks into their own crates. Though
error types would require some additional work.
Use of "dynamic typing" has drawbacks -- mainly slightly larger runtime
overhead. Before this patch benchmarks showed that SharedTensor::read()
takes
19-22ns, now it takes 23-26ns. For comparison, minimal synchronized CUDA
operation will take about 10-40us. Small NN layers on CPU are much faster,
e.g. 10-input softmax layer takes about 500ns. Still, in typical NNs overhead
looks negligible, and I think it's fair tradeoff for code clarity and better
decoupling.
Here are actual benches, before:
test bench_shared_tensor_access_time_first ... bench: 19 ns/iter (+/- 2)
test bench_shared_tensor_access_time_second ... bench: 21 ns/iter (+/- 0)
after:
test bench_shared_tensor_access_time_first ... bench: 23 ns/iter (+/- 0)
test bench_shared_tensor_access_time_second ... bench: 26 ns/iter (+/- 3)
What's your opinion on it?