Merge pull request #5 from facebookresearch/readme-arxiv

nicolasvasilache · web-flow · commit 2679f46e42a7 · 2018-02-13T20:27:36.000-08:00
adding arXiv link to readme + small improvements to readme
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # ![Tensor Comprehensions](docs/source/_static/img/tc-logo-full-color-with-text-2.png)
 
-Tensor Comprehensions (TC) is a fully-functional C++ library to *automatically* synthesize high-performance machine learning kernels using [Halide](https://github.com/halide/Halide), [ISL](http://isl.gforge.inria.fr/) and NVRTC or LLVM. TC additionally provides basic integration with Caffe2 and pybind11 bindings for use with python.
+Tensor Comprehensions (TC) is a fully-functional C++ library to *automatically* synthesize high-performance machine learning kernels using [Halide](https://github.com/halide/Halide), [ISL](http://isl.gforge.inria.fr/) and NVRTC or LLVM. TC additionally provides basic integration with Caffe2 and pybind11 bindings for use with python. We provide more details in our paper on [arXiv](https://arxiv.org/abs/1802.04730).
 
 This library is designed to be highly portable, machine-learning-framework agnostic and only requires a simple tensor library with memory allocation, offloading and synchronization capabilities.
 
@@ -12,38 +12,29 @@ The following illustrates a short but powerful feature of the library: the capac
 
 ```cpp
   #include <ATen/ATen.h>
-
   #include "tc/aten/aten_compiler.h"
   #include "tc/core/mapping_options.h"
 
-  // 1. Define and setup the TC compilation unit with CUDA memory
-  // management backed by ATen tensors.
+  // 1. Define and setup the TC compilation unit with CUDA memory management backed by ATen.
   std::string tc = R"TC(
-    def channel_contraction(float(N, C1, C2, H, W) I0,
-                            float(N, C2, C3, H, W) I1)
-    -> (O)
-    {
-      O(n, c1, c3, h, w) +=! I0(n, c1, c2, h, w) * I1(n, c2, c3, h, w)
-    }
-  )TC";
-
-  tc::ATenCompilationUnit atCompl;
-  atCompl.define(tc);
+  def TensorDot(float(N, C1, C2, H, W) I0, float(N, C2, C3, H, W) I1) -> (O) {
+    O(n, c1, c3, h, w) +=! I0(n, c1, c2, h, w) * I1(n, c2, c3, h, w)
+  })TC";
 
   // 2. Allocate tensors with random data
-  std::vector<at::Tensor> outputs;
   at::Tensor I0 = at::CUDA(at::kFloat).rand({32, 512, 8, 28, 28});
-  at::Tensor I1 = at::CUDA(at::kFloat).rand({32,   8, 2, 28, 28});;
+  at::Tensor I1 = at::CUDA(at::kFloat).rand({32,   8, 2, 28, 28});
+  std::vector<at::Tensor> outputs;
 
   // 3. Run autotuning with evolutionary search starting from a naive option
   auto options = tc::MappingOptions::makeNaiveMappingOptions();
-  auto bestOption =
-    autotune(cacheFilename, TC, "channel_contraction", {I0, I1}, options, {options});
+  auto bestOption = autotune(cacheFilename, tc, "TensorDot", {I0, I1}, options, {options});
 
   // 4. Compile and run the TC with the best option.
-  // Outputs get allocated; could also be pre-allocated and passed
-  auto handle = atCompl.compile("channel_contraction", {I0, I1}, bestOption);
-  atCompl.run("channel_contraction", {I0, I1}, outputs, handle);
+  tc::ATenCompilationUnit atCompl;
+  atCompl.define(tc);
+  auto handle = atCompl.compile("TensorDot", {I0, I1}, bestOption);
+  atCompl.run("TensorDot", {I0, I1}, outputs, handle);
 
   // 5. Perform precision checks against an ATen reference implementation
   check({I0, I1}, outputs, [&I0, &I1](){ return ...; });
@@ -55,37 +46,19 @@ After a few generations of autotuning on a 2-GPU P100 system, we see results res
 
 We have not yet characterized the precise fraction of peak performance we obtain but it is not uncommon to obtain 80%+ of peak shared memory bandwidth after autotuning. Solid register-level optimizations are still in the work but TC in its current form already addresses the productivity gap between the needs of research and the needs of production. Which is why we are excited to share it with the entire community and bring this collaborative effort in the open.
 
-# Documentation, Environment and Prerequisites
-We provide pre-built docker images in the docker subdirectory, they can be downloaded from [dockerhub](https://hub.docker.com/u/tensorcomprehensions/). We use and support those images as part of our continuous integration. Note that we can cross-compile CUDA (but not execute) even if the machine has no physical GPUs. In any case the CUDA toolkit and libraries should always be installed, for now.
-
-To get started, see the [docs](master/docs) directory.
-
-# Preparing the source
+# Installation / Documentation
+You can find documentation [here](https://facebookresearch.github.io/TensorComprehensions/) which contains instructions for building TC via docker, conda packages or in non-conda environment.
 
-Once the environment is set up properly you can:
-``` shell
-git clone --recursive git@github.com:facebookresearch/TensorComprehensions.git
-cd TensorComprehensions
-```
-
-# Build and test
+# Communication
 
-```shell
-BUILD_TYPE=Release CLANG_PREFIX=$(llvm-config --prefix) ./build.sh --all && ./test_cpu.sh
-BUILD_TYPE=Release CLANG_PREFIX=$(llvm-config --prefix) ./build.sh --all && ./test.sh
-```
+* **GitHub issues**: bug reports, feature requests, install issues, RFCs, thoughts, etc.
+* **Slack**: For discussion around framework integration, build support, collaboration, etc. join our slack channel https://tensorcomprehensions.slack.com. You may need an invitation to join, contact us by email at tensorcomp@fb.com to get one.
 
-# Build and test with Caffe2
-
-```shell
-BUILD_TYPE=Release WITH_CAFFE2=ON CLANG_PREFIX=$(llvm-config --prefix) ./build.sh --all && ./build/test/test_caffe2
-```
+# Code of Conduct
+See the [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) file for more details.
 
 # License
 Tensor Comprehensions is distributed under a permissive Apache v2.0 license, see the [LICENSE](LICENSE) file for more details.
 
-# Code of Conduct
-See the [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) file for more details.
-
 # Contributing
 See the [CONTRIBUTING.md](CONTRIBUTING.md) file for more details.