diff --git a/MANIFEST.in b/MANIFEST.in
index 549cc6983c..7ebb4241f4 100644
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -5,3 +5,5 @@ graft contrib
 recursive-include hls4ml/templates *
 global-exclude .git .gitmodules .gitlab-ci.yml
 include hls4ml/backends/vivado_accelerator/supported_boards.json
+include hls4ml/backends/vitis_accelerator/supported_boards.json
+include hls4ml/backends/vitis_accelerator/vivado_directives.json
diff --git a/docs/backend/accelerator.rst b/docs/backend/accelerator.rst
index 187bccaa2c..d9cb4e31b1 100644
--- a/docs/backend/accelerator.rst
+++ b/docs/backend/accelerator.rst
@@ -75,3 +75,112 @@ The ``predict`` method will send the input data to the PL and return the output
 
     nn = NeuralNetworkOverlay('hls4ml_nn.bit', X_test.shape, y_test.shape)
     y_hw, latency, throughput = nn.predict(X_test, profile=True)
+
+================
+VitisAccelerator
+================
+
+The ``VitsAccelerator`` backend leverages the `Vitis System Design Flow <https://www.xilinx.com/products/design-tools/vitis.html#design-flows>`_ to automate and simplify the creation of an hls4ml project targeting `AMD Alveo PCIe accelerators <https://www.amd.com/en/products/accelerators/alveo.html>`_.
+The Vitis accelerator backend has been tested with the following boards:
+
+* `Alveo u50 <https://www.xilinx.com/products/boards-and-kits/alveo/u50.html>`_
+* `Alveo u55c <https://www.xilinx.com/products/boards-and-kits/alveo/u55c.html>`_
+* `Alveo u250 <https://www.xilinx.com/products/boards-and-kits/alveo/u250.html>`_
+* `Versal vck5000 <https://www.xilinx.com/products/boards-and-kits/vck5000.html>`_
+
+Kernel wrapper
+==============
+
+To integrate with the Vitis System Design Flow and run on an accelerator, the generated ``hls4ml`` model must be encapsulated and built as a Vitis kernel (``.xo`` file) and linked into a binary file (``.xclbin``) during the implementation step. On the host side, standard C++ code using either `OpenCL <https://xilinx.github.io/XRT/master/html/opencl_extension.html>`_ or `XRT API <https://xilinx.github.io/XRT/master/html/xrt_native_apis.html>`_ can be used to download the ``.xclbin`` file to the accelerator card and use any kernel it contains.
+
+The ``VitisAccelerator`` backend automatically generates a kernel wrapper, an host code example, and a Makefile to build the project.
+
+**Note:** The current implementation of the kernel wrapper code is oriented toward throughput benchmarking and not general inference uses (See :ref:`here<hardware_predict-method>`). It can nonetheless be further customized to fit specific applications.
+
+Options
+=======
+
+As PCIe accelerators are not suitable for ultra-low latency applications, it is assumed that they are used for high-throughput applications. To accommodate this, the backend supports the following options to optimize the kernel for throughput:
+
+    * ``num_kernel``: Number of kernel instance to implement in the hardware architecture.
+    * ``num_thread``: Number of host threads used to exercise the kernels in the host application.
+    * ``batchsize``: Number of samples to be processed in a single kernel execution.
+
+Additionaly, the backend proposes the following options to customize the implementation:
+
+    * ``board``: The target board, must match one entry in ``supported_boards.json``.
+    * ``clock_period``: The target clock period in ns.
+    * ``hw_quant``: Is arbitrary precision quantization performed in hardware or not. If True, the quantization is performed in hardware and float are used at the kernel interface, otherwise it is performed in software and arbitrary precision types are used at the interface. (Defaults to  ``False``).
+    * ``vivado_directives``: A list of strings to be added under the ``[Vivado]`` section of the generated ``accelerator_card.cfg`` link configuration file. Can be used to add custom directives to the Vivado project.
+
+Build workflow
+==============
+
+At the call of the ``build`` method, the following option affect the build process:
+
+    * ``reset``: If True, clears files generated during previous build processes (Equivalent to ``make clean`` in build folder).
+    * ``target``: Can be one of ``hw``, ``hw_emu``, ``sw_emu``, to define which build target to use (Default is ``hw``).
+    * ``debug``: If True, compiles the c++ host code and the HLS in debug mode.
+
+Once the project is generated, it possible to run manually the build steps by using one of the following ``make`` targets in the generated project directory:
+
+    * ``host``: Compiles the host application.
+    * ``hls``: Produces only the kernel's object file.
+    * ``xclbin``: Produces only the kernel's .xclbin file.
+    * ``clean``: Removes all generated files.
+    * ``run``: Run the host application using the .xclbin file and the input data present in ``tb_data/tb_input_features.dat``.
+
+It is also possible to run the full build process by calling ``make`` without any target. Modifications to the ``accelerator_card.cfg`` file can be done manually before running the build process (e.g., to change the clock period, or add addition ``.xo`` kernel to the build).
+
+Host code
+=========
+
+Once built, the host program can be run to load the board and perform inferences:
+
+.. code-block:: Bash
+
+    ./host
+
+By defaut, all Computing Unit (CU) on all compatible devices will be used, with 3 worker thread per CU.
+
+The generated host code application support the following options to tweak the execution:
+
+ * ``-d``: device BDF to use (can be specified multiple times)
+ * ``-x``: XCLBIN path
+ * ``-i``: input feature file
+ * ``-o``: output feature file
+ * ``-c``: maximum computing units count to use
+ * ``-n``: number of worker threads to use
+ * ``-r``: number of repeatition of the input feature file (For artificially increasing the data size for benchmarking purpose)
+ * ``-v``: enable verbose output
+ * ``-h``: print help
+
+The following example shows how to limit on only one device, one CU, and on worker thread:
+
+.. code-block:: Bash
+
+    ./host -d 0000:c1:00.1 -c 1 -n 1
+
+Example
+=======
+
+The following example is a modified version of `hsl4ml example 7 <https://github.com/fastmachinelearning/hls4ml-tutorial/blob/master/part7_deployment.ipynb>`_.
+
+.. code-block:: Python
+
+    import hls4ml
+    hls_model = hls4ml.converters.convert_from_keras_model(
+        model,
+        hls_config=config,
+        output_dir='model_3/hls4ml_prj_vitis_accel',
+        backend='VitisAccelerator',
+        board='alveo-u55c',
+        num_kernel=4,
+        num_thread=8,
+        batchsize=8192,
+        hw_quant=False,
+        vivado_directives=["prop=run.impl_1.STEPS.PLACE_DESIGN.ARGS.DIRECTIVE=Explore"]
+    )
+    hls_model.compile()
+    hls_model.build()
+    y = hls_model.predict_hardware(y) # Limited to batchsize * num_kernel * num_thread for now
diff --git a/docs/ir/modelgraph.rst b/docs/ir/modelgraph.rst
index 048e67e101..cfe6f6f335 100644
--- a/docs/ir/modelgraph.rst
+++ b/docs/ir/modelgraph.rst
@@ -102,3 +102,24 @@ The trace method is an advanced version of the ``predict`` method. It's used to
 
    #We also support a similar function for keras
    keras_trace = hls4ml.model.profiling.get_ymodel_keras(keras_model, X)
+
+----
+
+.. _hardware_predict-method:
+
+``hardware_predict`` method
+===========================
+
+A specialized version of the ``predict`` method, for the VitisAccelerator backend after a successful build. Runs the project on the FPGA and obtains prediction for the supplied numpy array.
+
+**Note:** The host code being run under the hood is an example written for generic benchmarking purposes, helpful for validating projects and gauging maximum throughput. It should be further adapted for more specific applications. Currently, the maximum number of input samples that can be processed is ``batchsize * num_cu * num_buffer``. If the input array exceeds that size, the additional samples will be ignored.
+
+An optional ``target`` argument can be used to specify the target emulation mode (``hw``, ``sw_emu``, ``hw_emu``) to run the project on. The default is ``hw``.
+
+.. code-block:: python
+
+   # Suppose that you already have input array X
+   # Note that you have to do both hls_model.compile() and hls_model.build(), ensuring the
+   # .xclbin file is successfully created, before using hardware_predict
+
+   y = hls_model.hardware_predict(X)
diff --git a/hls4ml/backends/__init__.py b/hls4ml/backends/__init__.py
index 4a48f072cd..66b69626fc 100644
--- a/hls4ml/backends/__init__.py
+++ b/hls4ml/backends/__init__.py
@@ -3,6 +3,7 @@
 from hls4ml.backends.oneapi.oneapi_backend import OneAPIBackend
 from hls4ml.backends.quartus.quartus_backend import QuartusBackend
 from hls4ml.backends.symbolic.symbolic_backend import SymbolicExpressionBackend
+from hls4ml.backends.vitis_accelerator.vitis_accelerator_config import VitisAcceleratorConfig  # noqa: F401
 from hls4ml.backends.vivado.vivado_backend import VivadoBackend
 from hls4ml.backends.vivado_accelerator.vivado_accelerator_backend import VivadoAcceleratorBackend
 from hls4ml.backends.vivado_accelerator.vivado_accelerator_config import VivadoAcceleratorConfig  # noqa: F401
@@ -10,10 +11,13 @@
 from hls4ml.backends.catapult.catapult_backend import CatapultBackend  # isort: skip
 
 from hls4ml.backends.vitis.vitis_backend import VitisBackend  # isort: skip
+from hls4ml.backends.vitis_accelerator.vitis_accelerator_backend import VitisAcceleratorBackend  # isort: skip
+
 
 register_backend('Vivado', VivadoBackend)
 register_backend('VivadoAccelerator', VivadoAcceleratorBackend)
 register_backend('Vitis', VitisBackend)
+register_backend('VitisAccelerator', VitisAcceleratorBackend)
 register_backend('Quartus', QuartusBackend)
 register_backend('Catapult', CatapultBackend)
 register_backend('SymbolicExpression', SymbolicExpressionBackend)
diff --git a/hls4ml/backends/vitis_accelerator/__init__.py b/hls4ml/backends/vitis_accelerator/__init__.py
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/hls4ml/backends/vitis_accelerator/passes/__init__.py b/hls4ml/backends/vitis_accelerator/passes/__init__.py
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/hls4ml/backends/vitis_accelerator/passes/feature_check.py b/hls4ml/backends/vitis_accelerator/passes/feature_check.py
new file mode 100644
index 0000000000..d7f9c2a7f5
--- /dev/null
+++ b/hls4ml/backends/vitis_accelerator/passes/feature_check.py
@@ -0,0 +1,34 @@
+from hls4ml.model.optimizer import OptimizerPass
+
+
+class ValidateConvImplementation(OptimizerPass):
+    def match(self, node):
+        return 'Conv' in node.class_name
+
+    def transform(self, model, node):
+        if node.get_attr('implementation', 'linebuffer') == 'encoded':
+            print(
+                f'WARNING: "Encoded" implementation in "{node.name}" ({node.class_name}) is not supported in Vitis backend. '
+                'Switching to "LineBuffer" implementation.'
+            )
+            node.set_attr('implementation', 'linebuffer')
+
+
+class ValidateStrategy(OptimizerPass):
+    _resource_layer_cls = ['Conv1D', 'Conv2D', 'Dense']
+
+    def match(self, node):
+        is_resource_layer = len([layer_cls for layer_cls in self._resource_layer_cls if layer_cls in node.class_name]) > 0
+        is_resource_strategy = node.model.config.is_resource_strategy(node)
+
+        return is_resource_layer and is_resource_strategy
+
+    def transform(self, model, node):
+        n_in, _ = model.config.backend.get_layer_mult_size(node)
+        rf = node.get_attr('reuse_factor')
+        if rf > n_in and rf % n_in > 0:
+            print(
+                f'WARNING: "Resource" strategy in "{node.name}" ({node.class_name}) may have suboptimal QoR in Vitis '
+                'backend due to use of "urem" cores.\n'
+                'Consider using a different ReuseFactor or switching to "Latency" strategy.'
+            )
diff --git a/hls4ml/backends/vitis_accelerator/supported_boards.json b/hls4ml/backends/vitis_accelerator/supported_boards.json
new file mode 100644
index 0000000000..efa6c2894b
--- /dev/null
+++ b/hls4ml/backends/vitis_accelerator/supported_boards.json
@@ -0,0 +1,26 @@
+{
+    "alveo-u55c": {
+      "board_type": "alveo",
+      "part": "xcu55c-fsvh2892-2L-e",
+      "platform": ["xilinx_u55c_gen3x16_xdma_3_202210_1"],
+      "memory": {"type": "hbm", "channels": 32, "capacity": 16}
+    },
+    "alveo-u50": {
+      "board_type": "alveo",
+      "part": "xcu50-fsvh2104-2-e",
+      "platform": ["xilinx_u50_gen3x16_xdma_5_202210_1"],
+      "memory": {"type": "hbm", "channels": 32, "capacity": 8}
+    },
+    "alveo-u250": {
+      "board_type": "alveo",
+      "part": "xcu250-figd2104-2L-e",
+      "platform": ["xilinx_u250_xdma_201830_2"],
+      "memory": {"type": "ddr", "channels": 4, "capacity": 64}
+    },
+    "vck5000": {
+      "board_type": "versal",
+      "part": "xcvc1902-vsvd1760-2MP-e-S",
+      "platform": ["xilinx_vck5000_gen4x8_qdma_2_202220_1"],
+      "memory":{"type": "ddr", "channels": 3, "capacity": 12}
+    }
+  }
diff --git a/hls4ml/backends/vitis_accelerator/vitis_accelerator_backend.py b/hls4ml/backends/vitis_accelerator/vitis_accelerator_backend.py
new file mode 100644
index 0000000000..0ca8496f25
--- /dev/null
+++ b/hls4ml/backends/vitis_accelerator/vitis_accelerator_backend.py
@@ -0,0 +1,165 @@
+import os
+import subprocess
+import sys
+
+import numpy as np
+
+from hls4ml.backends import VitisBackend, VivadoBackend
+from hls4ml.model.flow import get_flow, register_flow
+
+
+class VitisAcceleratorBackend(VitisBackend):
+    def __init__(self):
+        super(VivadoBackend, self).__init__(name="VitisAccelerator")
+        self._register_layer_attributes()
+        self._register_flows()
+
+    def create_initial_config(
+        self,
+        board="alveo-u55c",
+        platform=None,
+        part=None,
+        clock_period=5,
+        clock_uncertainty='27%',
+        io_type="io_parallel",
+        num_kernel=1,
+        num_worker=1,
+        batchsize=8192,
+        hw_quant=False,
+        vivado_directives=None,
+        **_,
+    ):
+        """
+        Create initial accelerator config with default parameters
+
+        Args:
+            board: one of the keys defined in supported_boards.json
+            clock_period: clock period passed to hls project
+            io_type: io_parallel or io_stream
+            num_kernel: how many compute units to create on the fpga
+            num_worker: how many threads the host cpu uses to drive each CU on the fpga
+            batchsize: how many samples to process within a single buffer on the fpga
+            vivado_directives: Directives passed down to Vivado that controls the hardware synthesis and implementation steps
+        Returns:
+            populated config
+        """
+        board = board if board is not None else "alveo-u55c"
+        config = super().create_initial_config(part, clock_period, clock_uncertainty, io_type)
+        config["AcceleratorConfig"] = {}
+        config["AcceleratorConfig"]["Board"] = board
+        config["AcceleratorConfig"]["Platform"] = platform
+        config["AcceleratorConfig"]["Num_Kernel"] = num_kernel
+        config["AcceleratorConfig"]["Num_Worker"] = num_worker
+        config["AcceleratorConfig"]["Batchsize"] = batchsize
+        config["AcceleratorConfig"]["HW_Quant"] = hw_quant
+        config["AcceleratorConfig"]["Vivado_Directives"] = vivado_directives
+        return config
+
+    def build(
+        self,
+        model,
+        reset=False,
+        target="hw",
+        debug=False,
+        **kwargs,
+    ):
+        self._validate_target(target)
+
+        if "linux" in sys.platform:
+
+            curr_dir = os.getcwd()
+            os.chdir(model.config.get_output_dir())
+
+            command = f"TARGET={target} "
+
+            if debug:
+                command += "DEBUG=1 "
+
+            command += " make all"
+
+            # Cleaning
+            if reset:
+                os.system(f"TARGET={target} make clean")
+
+            # Pre-loading libudev
+            ldconfig_output = subprocess.check_output(["ldconfig", "-p"]).decode("utf-8")
+            for line in ldconfig_output.split("\n"):
+                if "libudev.so" in line and "x86" in line:
+                    command = "LD_PRELOAD=" + line.split("=>")[1].strip() + " " + command
+                    break
+            os.system(command)
+
+            os.chdir(curr_dir)
+        else:
+            raise Exception("Currently untested on non-Linux OS")
+
+    def numpy_to_dat(self, model, x):
+        if len(model.get_input_variables()) != 1:
+            raise Exception("Currently unsupported for multi-input/output projects")
+
+        # Verify numpy array of correct shape
+        expected_shape = model.get_input_variables()[0].size()
+        actual_shape = np.prod(x.shape[1:])
+        if expected_shape != actual_shape:
+            raise Exception(f"Input shape mismatch, got {x.shape}, expected (_, {expected_shape})")
+
+        # Write to tb_data/tb_input_features.dat
+        samples = x.reshape(x.shape[0], -1)
+        input_dat = f"{model.config.get_output_dir()}/tb_data/tb_input_features.dat"
+        np.savetxt(input_dat, samples, fmt="%.4e")
+
+    def dat_to_numpy(self, model):
+        expected_shape = model.get_output_variables()[0].size()
+        output_file = f"{model.config.get_output_dir()}/tb_data/hw_results.dat"
+        y = np.loadtxt(output_file, dtype=float).reshape(-1, expected_shape)
+        return y
+
+    def hardware_predict(self, model, x, target="hw", debug=False, profilingRepeat=-1):
+        if debug:
+            command = "DEBUG=1 "
+        if isinstance(profilingRepeat, int) and profilingRepeat > 0:
+            command += "PROFILING_DATA_REPEAT_COUNT=" + profilingRepeat + " "
+        self._validate_target(target)
+
+        self.numpy_to_dat(model, x)
+
+        currdir = os.getcwd()
+        os.chdir(model.config.get_output_dir())
+        command += "TARGET=" + target + " make run"
+        os.system(command)
+        os.chdir(currdir)
+
+        return self.dat_to_numpy(model)
+
+    def _register_flows(self):
+        validation_passes = [
+            "vitisaccelerator:validate_conv_implementation",
+            "vitisaccelerator:validate_strategy",
+        ]
+        validation_flow = register_flow(
+            "validation",
+            validation_passes,
+            requires=["vivado:init_layers"],
+            backend=self.name,
+        )
+
+        # Any potential templates registered specifically for Vitis backend
+        template_flow = register_flow(
+            "apply_templates",
+            self._get_layer_templates,
+            requires=["vivado:init_layers"],
+            backend=self.name,
+        )
+
+        writer_passes = ["make_stamp", "vitisaccelerator:write_hls"]
+        self._writer_flow = register_flow("write", writer_passes, requires=["vitis:ip"], backend=self.name)
+
+        ip_flow_requirements = get_flow("vivado:ip").requires.copy()
+        ip_flow_requirements.insert(ip_flow_requirements.index("vivado:init_layers"), validation_flow)
+        ip_flow_requirements.insert(ip_flow_requirements.index("vivado:apply_templates"), template_flow)
+
+        self._default_flow = register_flow("ip", None, requires=ip_flow_requirements, backend=self.name)
+
+    def _validate_target(self, target):
+        if target not in ["hw", "hw_emu", "sw_emu"]:
+            raise Exception("Invalid target, must be one of 'hw', 'hw_emu' or 'sw_emu'")
diff --git a/hls4ml/backends/vitis_accelerator/vitis_accelerator_config.py b/hls4ml/backends/vitis_accelerator/vitis_accelerator_config.py
new file mode 100644
index 0000000000..c123a3382b
--- /dev/null
+++ b/hls4ml/backends/vitis_accelerator/vitis_accelerator_config.py
@@ -0,0 +1,75 @@
+import json
+import os
+
+
+class VitisAcceleratorConfig:
+    def __init__(self, config):
+        self.config = config.config
+        accel_config = self.config.get("AcceleratorConfig", None)
+        if accel_config is None:
+            raise Exception("Missing AcceleratorConfig")
+
+        self.board = accel_config.get("Board", "alveo-u55c")
+        self.supported_boards = json.load(open(os.path.dirname(__file__) + "/supported_boards.json"))
+        if self.board in self.supported_boards.keys():
+            board_info = self.supported_boards[self.board]
+            self.board_type = board_info["board_type"]
+            self.part = board_info["part"]
+            if accel_config.get("Platform") is not None:
+                if accel_config.get("Platform") in board_info["platform"]:
+                    self.platform = accel_config.get("Platform")
+                else:
+                    print(
+                        "WARNING: You set an unrecognized Platform."
+                        "Using " + board_info["platform"][0] + " platform instead"
+                    )
+                    self.platform = board_info["platform"][0]
+            else:
+                print("Using " + board_info["platform"][0] + " platform")
+                self.platform = board_info["platform"][0]
+            self.memory_type = board_info["memory"]["type"]
+            self.memory_channel_count = board_info["memory"]["channels"]
+        else:
+            raise Exception("The board does not appear in supported_boards.json file")
+
+        if self.config.get("Part") is not None:
+            if self.config.get("Part") != self.part:
+                print(
+                    "WARNING: You set a Part that does not correspond to the Board you specified."
+                    "The correct Part is now set."
+                )
+                self.config["Part"] = self.part
+
+        self.num_kernel = accel_config.get("Num_Kernel", 1)
+        self.num_worker = accel_config.get("Num_Worker", 1)
+        self.batchsize = accel_config.get("Batchsize", 8192)
+        self.hw_quant = accel_config.get("HW_Quant", False)
+
+        self.vivado_directives = accel_config.get("Vivado_Directives", [])
+
+    def get_board_type(self):
+        return self.board_type
+
+    def get_platform(self):
+        return self.platform
+
+    def get_num_worker(self):
+        return self.num_worker
+
+    def get_num_kernel(self):
+        return self.num_kernel
+
+    def get_batchsize(self):
+        return self.batchsize
+
+    def get_memory_type(self):
+        return self.memory_type
+
+    def get_memory_channel_count(self):
+        return self.memory_channel_count
+
+    def get_hw_quant(self):
+        return self.hw_quant
+
+    def get_vivado_directives(self):
+        return self.vivado_directives
diff --git a/hls4ml/backends/vitis_accelerator/vivado_directives.json b/hls4ml/backends/vitis_accelerator/vivado_directives.json
new file mode 100644
index 0000000000..1f6abafb35
--- /dev/null
+++ b/hls4ml/backends/vitis_accelerator/vivado_directives.json
@@ -0,0 +1,137 @@
+{
+    "impl.strategies": [
+        "Performance_Explore",
+        "Performance_ExplorePostRoutePhysOpt",
+        "Performance_LBlockPlacement",
+        "Performance_LBlockPlacementFanoutOpt",
+        "Performance_NetDelay_high",
+        "Performance_NetDelay_low",
+        "Performance_Retiming",
+        "Performance_ExtraTimingOpt",
+        "Performance_RefinePlacement",
+        "Performance_SpreadSLL",
+        "Performance_BalanceSLL",
+        "Congestion_SpreadLogic_high",
+        "Congestion_SpreadLogic_medium",
+        "Congestion_SpreadLogic_low",
+        "Congestion_SpreadLogic_Explore",
+        "Congestion_SSI_SpreadLogic_high",
+        "Congestion_SSI_SpreadLogic_low",
+        "Area_Explore",
+        "Area_ExploreSequential",
+        "Area_ExploreWithRemap",
+        "Power_DefaultOpt",
+        "Power_ExploreArea",
+        "Flow_RunPhysOpt",
+        "Flow_RunPostRoutePhysOpt",
+        "Flow_RuntimeOptimized",
+        "Flow_Quick",
+        "ALL"
+    ],
+    "prop": {
+        "run": {
+            "impl": {
+                "STEPS": {
+                    "OPT_DESIGN": {
+                        "ARGS": {
+                            "DIRECTIVE": [
+                                "Explore",
+                                "ExploreArea",
+                                "ExploreSequentialArea",
+                                "RuntimeOptimized",
+                                "ExploreWithRemap"
+                            ]
+                        }
+                    },
+                    "POWER_OPT_DESIGN": {
+                        "IS_ENABLED": [
+                            "true"
+                        ]
+                    },
+                    "PLACE_DESIGN": {
+                        "ARGS": {
+                            "DIRECTIVE": [
+                                "Explore",
+                                "WLDrivenBlockPlacement",
+                                "EarlyBlockPlacement",
+                                "ExtraNetDelay_high",
+                                "ExtraNetDelay_low",
+                                "SSI_SpreadLogic_high",
+                                "SSI_SpreadLogic_low",
+                                "AltSpreadLogic_high",
+                                "AltSpreadLogic_medium",
+                                "AltSpreadLogic_low",
+                                "ExtraPostPlacementOpt",
+                                "ExtraTimingOpt",
+                                "SSI_SpreadSLLs",
+                                "SSI_BalanceSLLs",
+                                "SSI_Balance_SLRs",
+                                "SSI_HighUtilSLRs",
+                                "RuntimeOptimized",
+                                "Quick",
+                                "Auto_1",
+                                "Auto_2",
+                                "Auto_3"
+                            ]
+                        }
+                    },
+                    "POST_PLACE_POWER_OPT_DESIGN": {
+                        "IS_ENABLED": [
+                            "true"
+                        ]
+                    },
+                    "PHYS_OPT_DESIGN": {
+                        "IS_ENABLED": [
+                            "true"
+                        ],
+                        "ARGS": {
+                            "DIRECTIVE": [
+                                "Explore",
+                                "ExploreWithHoldFix",
+                                "ExploreWithAggressiveHoldFix",
+                                "AggressiveExplore",
+                                "AlternateReplication",
+                                "AggressiveFanoutOpt",
+                                "AddRetime",
+                                "AlternateFlowWithRetiming",
+                                "RuntimeOptimized"
+                            ]
+                        }
+                    },
+                    "ROUTE_DESIGN": {
+                        "ARGS": {
+                            "DIRECTIVE": [
+                                "Explore",
+                                "AggressiveExplore",
+                                "NoTimingRelaxation",
+                                "MoreGlobalIterations",
+                                "HigherDelayCost",
+                                "RuntimeOptimized",
+                                "AlternateCLBRouting",
+                                "Quick"
+                            ]
+                        }
+                    },
+                    "POST_ROUTE_PHYS_OPT_DESIGN": {
+                        "IS_ENABLED": [
+                            "true"
+                        ],
+                        "ARGS": {
+                            "DIRECTIVE": [
+                                "Explore",
+                                "ExploreWithHoldFix",
+                                "ExploreWithAggressiveHoldFix",
+                                "AggressiveExplore",
+                                "AlternateReplication",
+                                "AggressiveFanoutOpt",
+                                "AddRetime",
+                                "AlternateFlowWithRetiming",
+                                "RuntimeOptimized"
+                            ]
+                        }
+                    }
+                }
+            }
+        }
+    }
+}
diff --git a/hls4ml/model/graph.py b/hls4ml/model/graph.py
index 520f96ba5f..6e594e491a 100644
--- a/hls4ml/model/graph.py
+++ b/hls4ml/model/graph.py
@@ -884,6 +884,14 @@ class TraceData(ctypes.Structure):
         else:
             return output, trace_output
 
+    def hardware_predict(self, x, **kwargs):
+        """Currently only supported for VitisAccelerator backend"""
+        backend = self.config.config.get('Backend', 'Vivado')
+        if backend != 'VitisAccelerator':
+            raise Exception(f"Function unsupported for {backend} backend")
+
+        return self.config.backend.hardware_predict(self, x, **kwargs)
+
     def build(self, **kwargs):
         """Builds the generated project using HLS compiler.
 
diff --git a/hls4ml/templates/vitis_accelerator/Makefile b/hls4ml/templates/vitis_accelerator/Makefile
new file mode 100644
index 0000000000..e86d944011
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/Makefile
@@ -0,0 +1,150 @@
+# Check environment ###########################################################
+
+ifndef XILINX_VITIS
+$(error XILINX_VITIS variable is not set, please set correctly and rerun)
+endif
+
+ifndef XILINX_XRT
+$(error XILINX_XRT variable is not set, please set correctly and rerun)
+endif
+
+ifndef XILINX_VIVADO
+$(error XILINX_VIVADO variable is not set, please set correctly and rerun)
+endif
+
+ifneq ($(shell expr $(shell g++ -dumpversion) \>= 5), 1)
+CXX := $(XILINX_VIVADO)/tps/lnx64/gcc-6.2.0/bin/g++
+$(warning [WARNING]: g++ version older. Using g++ provided by the tool: $(CXX))
+endif
+
+# Configuration variables #####################################################
+
+# Absolute path to top directory of accelerator project
+PWD := $(shell pwd)
+
+# Target (hw, hw_emu, sw_emu)
+TARGET ?= hw
+
+# Accelerator card configuration file
+CARD_CFG ?= accelerator_card.cfg
+
+# Platform (currently extracted from accelerator_card.cfg if not already set)
+PLATFORM ?= $(shell awk -F '=' '/platform=/ {print $$2}' $(CARD_CFG))
+
+# Board Type (determines whether design will go through packaging step)
+BOARD_TYPE := #BOARDTYPE
+
+# Kernel name
+KERNEL_NAME := #PRJNAME
+
+# Wrapper name
+WRAPPER_NAME := kernel_wrapper
+
+# Top level build directory
+BUILD_DIR := ./build_$(TARGET)
+ifdef DEBUG
+BUILD_DIR := $(BUILD_DIR)_deb
+else
+BUILD_DIR := $(BUILD_DIR)_rel
+endif
+
+# Directories for kernel synthesis
+XO_DIR := $(BUILD_DIR)/xo
+XCLBIN_DIR := $(BUILD_DIR)/xclbin
+
+# CC flags for v++
+XOCCFLAGS := -t $(TARGET) --config $(CARD_CFG) --messageDb=$(BUILD_DIR)/kernel_wrapper.mdb
+
+# Linker flags for V++
+XOLDFLAGS := -t $(TARGET) --config $(CARD_CFG) --messageDb=$(BUILD_DIR)/kernel_wrapper.mdb
+
+# C++ compiler & linker flags
+CXXFLAGS := -Wall -std=c++11 -Wno-unknown-pragmas
+LDFLAGS = -L$(XILINX_XRT)/lib/ -lstdc++ -lpthread -lrt -lOpenCL
+
+ifdef DEBUG
+XOCCFLAGS += -g
+XOLDFLAGS += -g
+CXXFLAGS += -g -O0
+else
+# Optimization flags can be added here
+endif
+
+.PHONY: all xclbin hls run clean cleanhls cleanxclbin ultraclean
+
+all: xclbin host
+
+# Kernel C/RTL synthesis ######################################################
+
+HLS_INCLUDES := -I./ -I./firmware/ -I./firmware/weights -I./firmware/nnet_utils/
+
+$(BUILD_DIR)/$(KERNEL_NAME)_kernel.xo: $(WRAPPER_NAME).cpp firmware/$(KERNEL_NAME).cpp
+	mkdir -p $(XO_DIR)
+	v++ -c $(XOCCFLAGS) --temp_dir $(XO_DIR) --log_dir $(XO_DIR) -o $@ $^ $(HLS_INCLUDES)
+
+hls: $(BUILD_DIR)/$(KERNEL_NAME)_kernel.xo
+
+# Kernel linking & packaging ##################################################
+
+ifneq (,$(findstring versal,$(BOARD_TYPE)))
+
+# For Versal architecture, linking and packaging are separate steps
+$(BUILD_DIR)/$(WRAPPER_NAME).xsa: $(BUILD_DIR)/$(KERNEL_NAME)_kernel.xo
+	mkdir -p $(XCLBIN_DIR)
+	v++ -l $(XOLDFLAGS) --temp_dir $(XCLBIN_DIR) --log_dir $(XCLBIN_DIR) -o $@ $^
+
+# VCK5000 specific packaging
+XOCCPFLAGS := -t $(TARGET) -f $(PLATFORM) --package.boot_mode=ospi --messageDb=$(BUILD_DIR)/kernel_wrapper.mdb
+$(BUILD_DIR)/$(WRAPPER_NAME).xclbin: $(BUILD_DIR)/$(WRAPPER_NAME).xsa
+	v++ -p $(XOCCPFLAGS) --temp_dir $(XCLBIN_DIR) --log_dir $(XCLBIN_DIR) -o $@ $^
+
+else
+
+# For Standard Alveo, a single step is required for linking and packaging
+# This is standard Alveo linking and packaging
+$(BUILD_DIR)/$(WRAPPER_NAME).xclbin: $(BUILD_DIR)/$(KERNEL_NAME)_kernel.xo
+	mkdir -p $(XCLBIN_DIR)
+	v++ -l $(XOLDFLAGS) --temp_dir $(XCLBIN_DIR) --log_dir $(XCLBIN_DIR) -o $@ $^
+
+endif
+
+xclbin: $(BUILD_DIR)/$(WRAPPER_NAME).xclbin
+
+# Host compilation ############################################################
+
+INCLUDES := -I$(XILINX_XRT)/include/ -I$(XILINX_VIVADO)/include/ -I$(XILINX_HLS)/include/
+INCLUDES += -I$(PWD)/libs/ -I$(PWD)/firmware/ -I$(PWD)/firmware/nnet_utils/
+
+host: $(KERNEL_NAME)_host_cl.cpp libs/xcl2.cpp $(wildcard libs/*.hpp)
+	$(CXX) $(CXXFLAGS) $(KERNEL_NAME)_host_cl.cpp libs/xcl2.cpp -o $@ $(INCLUDES) $(LDFLAGS)
+
+# Execute program #############################################################
+
+run: ./host $(BUILD_DIR)/$(WRAPPER_NAME).xclbin
+ifeq ($(TARGET), hw)
+	@echo "TARGET is hw, not setting XCL_EMULATION_MODE"
+	$(eval EMULATION_MODE :=)
+else
+	@echo "Setting XCL_EMULATION_MODE to $(TARGET)"
+	$(eval EMULATION_MODE := XCL_EMULATION_MODE=$(TARGET))
+endif
+	@cd firmware && $(EMULATION_MODE) ../host ../$(BUILD_DIR)/$(WRAPPER_NAME).xclbin $(PROFILING_DATA_REPEAT_COUNT)
+
+# Cleanup #####################################################################
+
+cleanxclbin:
+	rm -rf host tb_data/hw_results.dat tb_data/tb_input_features.dat
+	rm -rf *$(WRAPPER_NAME)*.log
+	rm -rf $(BUILD_DIR)/$(WRAPPER_NAME).xclbin.* $(BUILD_DIR)/$(WRAPPER_NAME).xsa* $(BUILD_DIR)/$(WRAPPER_NAME).ltx $(BUILD_DIR)/$(WRAPPER_NAME).mdb
+	rm -rf $(XCLBIN_DIR)
+
+cleanhls:
+	rm -rf *$(KERNEL_NAME)_kernel*.log
+	rm -rf $(BUILD_DIR)/$(KERNEL_NAME)_kernel.xo.*
+	rm -rf $(XO_DIR)
+
+clean: cleanxclbin cleanhls
+
+ultraclean:
+	rm -rf host tb_data/hw_results.dat tb_data/tb_input_features.dat *.log
+	rm -rf $(BUILD_DIR)
diff --git a/hls4ml/templates/vitis_accelerator/accelerator_card.cfg b/hls4ml/templates/vitis_accelerator/accelerator_card.cfg
new file mode 100644
index 0000000000..6f832b8f25
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/accelerator_card.cfg
@@ -0,0 +1,14 @@
+kernel=kernel_wrapper
+platform=MYPLATFORM
+save-temps=1
+
+[advanced]
+prop=kernel.kernel_wrapper.kernel_flags=-std=c++11
+
+[hls]
+pre_tcl=./hls_config.tcl
+# hls-fpga-machine-learning clock control
+
+# hls-fpga-machine-learning kernel control
+
+# hls-fpga-machine-learning vivado directives
diff --git a/hls4ml/templates/vitis_accelerator/hls_config.tcl b/hls4ml/templates/vitis_accelerator/hls_config.tcl
new file mode 100644
index 0000000000..b4c7c5c441
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/hls_config.tcl
@@ -0,0 +1,2 @@
+config_interface -m_axi_auto_max_ports=true
+config_interface -m_axi_offset slave
diff --git a/hls4ml/templates/vitis_accelerator/kernel_wrapper.h b/hls4ml/templates/vitis_accelerator/kernel_wrapper.h
new file mode 100644
index 0000000000..3cbb0c81ed
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/kernel_wrapper.h
@@ -0,0 +1,10 @@
+#ifndef KERNEL_WRAPPER_H
+#define KERNEL_WRAPPER_H
+
+#include "firmware/defines.h"
+
+// hls-fpga-machine-learning accelerator parameters
+
+// hls-fpga-machine-learning accelerator io
+
+#endif
diff --git a/hls4ml/templates/vitis_accelerator/kernel_wrapper_io_parallel.cpp b/hls4ml/templates/vitis_accelerator/kernel_wrapper_io_parallel.cpp
new file mode 100644
index 0000000000..f1c2853f2c
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/kernel_wrapper_io_parallel.cpp
@@ -0,0 +1,47 @@
+#include "firmware/myproject.h"
+#include "kernel_wrapper.h"
+
+static void read_input(const /*IN_INTERFACE_TYPE*/ *in, in_buffer_t (&in_buf)[BATCHSIZE][DATA_SIZE_IN]) {
+    for (int i = 0; i < BATCHSIZE; i++) {
+        #pragma HLS PIPELINE
+        for (int j = 0; j < DATA_SIZE_IN; j++) {
+            #pragma HLS UNROLL
+            in_buf[i][j] = /*IN_HW_QUANT*/ in[i * DATA_SIZE_IN + j];
+        }
+    }
+}
+static void run_inference(in_buffer_t (&in_buf)[BATCHSIZE][DATA_SIZE_IN],
+                          out_buffer_t (&out_buf)[BATCHSIZE][DATA_SIZE_OUT]) {
+    for (int i = 0; i < BATCHSIZE; i++) {
+        #pragma HLS DATAFLOW
+        myproject(in_buf[i], out_buf[i]);
+    }
+}
+static void write_result(/*OUT_INTERFACE_TYPE*/ *out, out_buffer_t (&out_buf)[BATCHSIZE][DATA_SIZE_OUT]) {
+    for (int i = 0; i < BATCHSIZE; i++) {
+        #pragma HLS PIPELINE
+        for (int j = 0; j < DATA_SIZE_OUT; j++) {
+            #pragma HLS UNROLL
+            out[i * DATA_SIZE_OUT + j] = /*OUT_HW_QUANT*/ out_buf[i][j];
+        }
+    }
+}
+
+extern "C" {
+/**
+  \brief HLS4ML Kernel Implementation
+  \param in Input Vector
+  \param out Output Vector
+*/
+void kernel_wrapper(const /*IN_INTERFACE_TYPE*/ *in, /*OUT_INTERFACE_TYPE*/ *out) {
+    in_buffer_t in_buf[BATCHSIZE][DATA_SIZE_IN];
+    out_buffer_t out_buf[BATCHSIZE][DATA_SIZE_OUT];
+    #pragma HLS ARRAY_RESHAPE   variable=in_buf  complete dim=2
+    #pragma HLS ARRAY_RESHAPE   variable=out_buf complete dim=2
+
+    #pragma HLS DATAFLOW
+    read_input(in, in_buf);
+    run_inference(in_buf, out_buf);
+    write_result(out, out_buf);
+}
+}
diff --git a/hls4ml/templates/vitis_accelerator/kernel_wrapper_io_stream.cpp b/hls4ml/templates/vitis_accelerator/kernel_wrapper_io_stream.cpp
new file mode 100644
index 0000000000..320447cc2f
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/kernel_wrapper_io_stream.cpp
@@ -0,0 +1,43 @@
+#include "firmware/myproject.h"
+#include "kernel_wrapper.h"
+
+static void read_input(const /*IN_INTERFACE_TYPE*/ *in, hls::stream<input_t> &input, int n) {
+    for (int i = 0; i < DATA_SIZE_IN; i++) {
+        #pragma HLS PIPELINE
+        input_t tmp;
+        for (int j = 0; j < NNET_ARRAY_DEPTH; j++) {
+            #pragma HLS UNROLL
+            tmp[j] = /*IN_HW_QUANT*/ in[(n * DATA_SIZE_IN * NNET_ARRAY_DEPTH) + (i * NNET_ARRAY_DEPTH) + j];
+        }
+        input << tmp;
+    }
+}
+
+static void write_result(/*OUT_INTERFACE_TYPE*/ *out, hls::stream<result_t> &output, int n) {
+    result_t tmp = output.read();
+    for (int i = 0; i < DATA_SIZE_OUT; i++) {
+        #pragma HLS UNROLL
+        out[(n * DATA_SIZE_OUT) + i] = /*OUT_HW_QUANT*/ tmp[i];
+    }
+}
+
+extern "C" {
+/**
+  \brief HLS4ML Kernel Implementation
+  \param in Input Vector
+  \param out Output Vector
+*/
+void kernel_wrapper(const /*IN_INTERFACE_TYPE*/ *in, /*OUT_INTERFACE_TYPE*/ *out) {
+    hls::stream<input_t> input("input");
+    hls::stream<result_t> output("output");
+    #pragma HLS STREAM variable=input depth=DATA_SIZE_IN
+    #pragma HLS STREAM variable=output depth=1
+
+    for (int n = 0; n < BATCHSIZE; n++) {
+        #pragma HLS DATAFLOW
+        read_input(in, input, n);
+        myproject(input, output);
+        write_result(out, output, n);
+    }
+}
+}
diff --git a/hls4ml/templates/vitis_accelerator/libs/DataBatcher.hpp b/hls4ml/templates/vitis_accelerator/libs/DataBatcher.hpp
new file mode 100644
index 0000000000..96f15d4a1a
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/libs/DataBatcher.hpp
@@ -0,0 +1,255 @@
+#pragma once
+
+#include <cmath>
+#include <cstdint>
+#include <fstream>
+#include <iostream>
+#include <list>
+#include <sstream>
+#include <stdexcept>
+#include <string>
+#include <vector>
+
+#include "Types.hpp"
+
+template <class T, class U> class DataBatcher {
+  public:
+    /**
+     * \brief Constructor
+     * \param batchsize Number of samples
+     * \param sampleInputSize Flattened length of a single input to the model
+     * \param sampleOutputSize Flattened length of a single output from the model
+     * \param numWorkers Total number of workers
+     * \param profiling If true, the given data will be iterated over multiple times,
+     * for more accurate throughput testing.
+     * \param profilingDataRepeat Only used if profiling is set to True. Additional number of
+     * times the given data is iterated over.
+     */
+    DataBatcher(int batchsize, int sampleInputSize, int sampleOutputSize, int numWorkers, int profilingDataRepeat)
+        : _batchsize(batchsize), _sampleInputSize(sampleInputSize), _sampleOutputSize(sampleOutputSize),
+          _numWorkers(numWorkers), _profilingDataRepeat(profilingDataRepeat) {}
+
+    /**
+     * \brief Read in data to a buffer. Allocate space for results.
+     * \param filename Filename.
+     * \param s Type of input, currently supports text files used by VitisAccelerator backend, and
+     * binary files produced by NumPy's toFile() function
+     */
+    void read(const std::string &filename) {
+
+        std::ifstream fin(filename);
+        if (!fin.is_open()) {
+            throw std::runtime_error("Error opening file " + filename);
+        }
+
+        std::cout << "Reading data from: " << filename << std::endl;
+
+        std::string line;
+        while (std::getline(fin, line)) {
+            originalSampleCount++;
+            std::istringstream parser(line);
+            T val;
+            while (parser >> val) {
+                inputData.push_back(val);
+            }
+            if (!parser.eof()) {
+                throw std::runtime_error("Failed to parse value on line " + std::to_string(originalSampleCount));
+            }
+        }
+
+        std::cout << "Read in " << originalSampleCount << " samples (" << inputData.size() << " elements)" << std::endl;
+        fin.close();
+
+        // Zero-pad
+        numBatches = std::ceil(static_cast<double>(originalSampleCount) / _batchsize);
+        size_t finalSampleCount = numBatches * _batchsize;
+        if (finalSampleCount > originalSampleCount) {
+            std::cout << "Padding with " << (finalSampleCount - originalSampleCount) << " empty samples for a total of "
+                      << numBatches << " batches of " << _batchsize << " samples" << std::endl;
+            inputData.resize(finalSampleCount * _sampleInputSize, (T)0);
+        }
+    }
+
+    bool readReference(const std::string &filename) {
+
+        std::ifstream fref(filename);
+        if (!fref.is_open()) {
+            return false;
+        }
+
+        std::cout << "Reading data from: " << filename << std::endl;
+        size_t refSampleCount = 0;
+        std::string line;
+        while (std::getline(fref, line)) {
+            refSampleCount++;
+            std::istringstream parser(line);
+            T val;
+            while (parser >> val) {
+                refData.push_back(val);
+            }
+            if (!parser.eof()) {
+                throw std::runtime_error("Failed to parse value on line " + std::to_string(refSampleCount));
+            }
+        }
+
+        std::cout << "Read in " << refSampleCount << " reference samples (" << refData.size() << " elements)" << std::endl;
+        fref.close();
+        return true;
+    }
+
+    void checkResults() {
+        if (storedEvalResults.size() == 0 || refData.size() == 0) {
+            throw std::runtime_error("No data to check");
+        }
+
+        if (storedEvalResults.size() != refData.size()) {
+            throw std::runtime_error("Stored results and reference data are not the same size");
+        }
+        size_t error_count = 0;
+        for (uint64_t i = 0; i < storedEvalResults.size(); i++) {
+            if (storedEvalResults[i] != refData[i]) {
+                error_count++;
+                std::cout << "Mismatch at index " + std::to_string(i) + ": " + std::to_string((float)storedEvalResults[i]) +
+                                 " != " + std::to_string((float)refData[i])
+                          << ", error = " << ((float)storedEvalResults[i] - (float)refData[i]) << std::endl;
+            }
+        }
+
+        if (error_count > 0) {
+            std::cout << "Mismatch count: " << error_count << std::endl;
+            throw std::runtime_error("Results do not match reference data");
+        } else {
+            std::cout << "Results match reference data" << std::endl;
+        }
+    }
+
+    /**
+     * \brief Allocate space for writing results to.
+     */
+    void createResultBuffers() {
+        storedEvalResults.resize(numBatches * _batchsize * _sampleOutputSize, (U)0);
+
+        // Allocate space to dump the extra arbitrary data used during profiling
+        if (isProfilingMode()) {
+            profilingResultsDump.resize(_numWorkers * _batchsize * _sampleOutputSize, (U)0);
+        }
+    }
+
+    /**
+     * \brief Splits data into batches and distributes batches evenly amongst Workers.
+     * \param batchedData A vector of containers for each Worker's batches/workload.
+     * Size must be equal to _numWorkers.
+     */
+    void batch(std::vector<std::list<Batch<T, U>>> &batchedData) {
+        if (inputData.size() == 0 || originalSampleCount == 0) {
+            throw std::runtime_error("No data to batch");
+        }
+        std::cout << "Original sample count: " << originalSampleCount << std::endl;
+        std::cout << "Input sample element count: " << _sampleInputSize << std::endl;
+        std::cout << "Output sample element count: " << _sampleOutputSize << std::endl;
+        if (storedEvalResults.size() == 0) {
+            throw std::runtime_error("Create result buffers first");
+        }
+
+        batchedData.resize(_numWorkers);
+
+        uint64_t batchIndex = 0;
+        while (batchIndex < numBatches) {
+            int worker = batchIndex % _numWorkers;
+            uint64_t inputLocation = batchIndex * _batchsize * _sampleInputSize;
+            uint64_t outputLocation = batchIndex * _batchsize * _sampleOutputSize;
+
+            const T *in = &inputData[inputLocation];
+            U *out = &storedEvalResults[outputLocation];
+            Batch<T, U> newBatch = {in, out};
+
+            batchedData[worker].push_back(newBatch);
+            batchIndex++;
+        }
+
+        if (isProfilingMode()) {
+            std::cout << "Creating profiling batches" << std::endl;
+            profilingBatchCount = numBatches * (_profilingDataRepeat + 1);
+            std::cout << "Batches: " << numBatches << std::endl;
+            std::cout << "Profiling batch count: " << profilingBatchCount << std::endl;
+            std::cout << "Profiling data repeat: " << _profilingDataRepeat << std::endl;
+            std::cout << "Profiling total data count: " << profilingBatchCount * _batchsize << std::endl;
+            while (batchIndex < profilingBatchCount) {
+                int worker = batchIndex % _numWorkers;
+                uint64_t inputLocation = (batchIndex % numBatches) * _batchsize * _sampleInputSize;
+                uint64_t outputLocation = worker * _batchsize * _sampleOutputSize;
+
+                const T *in = &inputData[inputLocation];
+                U *out = &profilingResultsDump[outputLocation];
+                Batch<T, U> newBatch = {in, out};
+
+                batchedData[worker].push_back(newBatch);
+                batchIndex++;
+            }
+        }
+    }
+
+    /**
+     * \brief Releases resources used when reading from input files. Note: Data from those files
+     * will be cleared and will no longer be accessible.
+     */
+    void closeFile() {
+        inputData.clear();
+
+        originalSampleCount = 0;
+        numBatches = 0;
+        profilingBatchCount = 0;
+    }
+
+    void write(const std::string &filename) {
+        std::cout << "Writing HW results to: " << filename << std::endl;
+        std::ofstream fout;
+        fout.open(filename, std::ios::trunc);
+
+        if (fout.is_open()) {
+            for (uint64_t i = 0; i < originalSampleCount; i++) {
+                std::stringstream line;
+                for (int n = 0; n < _sampleOutputSize; n++) {
+                    line << (float)storedEvalResults[(i * _sampleOutputSize) + n] << " ";
+                }
+                fout << line.str() << "\n";
+            }
+            fout.close();
+        } else {
+            throw std::runtime_error("Error writing to file " + filename);
+        }
+
+        storedEvalResults.clear();
+        profilingResultsDump.clear();
+    }
+
+    uint64_t getSampleCount() { return originalSampleCount; }
+
+    uint64_t getPaddedSampleCount() { return numBatches * _batchsize; }
+
+    uint64_t getProfilingSampleCount() { return profilingBatchCount * _batchsize; }
+
+    bool isProfilingMode() { return _profilingDataRepeat > 0; }
+
+  private:
+    int _batchsize;
+    int _sampleInputSize;
+    int _sampleOutputSize;
+    int _numWorkers;
+    int _profilingDataRepeat;
+
+    /// @brief Number of floats read in. (Not including padding).
+    uint64_t originalSampleCount = 0;
+    /// @brief Number of batches of data. (After padding).
+    uint64_t numBatches = 0;
+    /// @brief Effective number of batches of data being evaluted.
+    uint64_t profilingBatchCount = 0;
+    /// @brief Vector with values.
+    std::vector<T> inputData;
+    /// @brief Vector with reference values.
+    std::vector<T> refData;
+    /// @brief Vector to store evaluation results.
+    std::vector<U> storedEvalResults;
+    /// @brief Vector for dumping results from extra arbitrary data used during profiling.
+    std::vector<U> profilingResultsDump;
+};
diff --git a/hls4ml/templates/vitis_accelerator/libs/FpgaObj.hpp b/hls4ml/templates/vitis_accelerator/libs/FpgaObj.hpp
new file mode 100644
index 0000000000..df3e1d9de9
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/libs/FpgaObj.hpp
@@ -0,0 +1,239 @@
+#pragma once
+
+#include <chrono>
+#include <cstdint>
+#include <iostream>
+#include <list>
+#include <stdexcept>
+#include <string>
+#include <thread>
+#include <vector>
+
+#include "DataBatcher.hpp"
+#include "Params.hpp"
+#include "Types.hpp"
+#include "Worker.hpp"
+#include "xcl2.hpp"
+
+template <class T, class U> class FpgaObj {
+  public:
+    /**
+     * \brief Constructor
+     * \param batchsize Number of samples
+     * \param sampleInputSize Flattened length of a single input to the model
+     * \param sampleOutputSize Flattened length of a single output from the model
+     * \param numCU Number of compute units synthesized on the FPGA
+     * \param xclbinFilename String containing path of synthesized xclbin
+     */
+    FpgaObj(const Params &params)
+        : _batchsize(params.batchSize), _sampleInputSize(params.sampleInputSize), _sampleOutputSize(params.sampleOutputSize),
+          _numCU(params.numCU), _xclbinFilename(params.xclbinFilename) {
+
+        if (params.deviceBDFs.size() == 0) {
+            // Finds all AMD/Xilinx devices present in system
+            devices = xcl::get_xil_devices();
+            if (devices.size() == 0) {
+                throw std::runtime_error("No AMD/Xilinx FPGA devices found");
+            }
+            for (auto &device : devices) {
+                std::string device_bdf;
+                OCL_CHECK(err, err = device.getInfo(CL_DEVICE_PCIE_BDF, &device_bdf));
+                std::cout << "Found device: " << device.getInfo<CL_DEVICE_NAME>() << " (" << device_bdf << ")" << std::endl;
+            }
+
+        } else {
+            // Find devices by BDF
+            devices.reserve(params.deviceBDFs.size());
+            for (auto &bdf : params.deviceBDFs) {
+                devices.push_back(xcl::find_device_bdf(xcl::get_xil_devices(), bdf));
+                std::cout << "Found device: " << devices.back().getInfo<CL_DEVICE_NAME>() << " (" << bdf << ")" << std::endl;
+            }
+        }
+
+        // Ensure that all devices are of the same type
+        for (auto &device : devices) {
+            std::string device_name = device.getInfo<CL_DEVICE_NAME>();
+            if (_deviceName.empty()) {
+                _deviceName = device_name;
+            } else if (_deviceName != device_name) {
+                throw std::runtime_error(
+                    "All devices must be of the same type, use -d to specify the BDFs of the devices you want to use");
+            }
+        }
+
+        _numDevice = devices.size();
+
+        // Load xclbin
+        std::cout << "Loading: " << _xclbinFilename << std::endl;
+        std::vector<unsigned char> fileBuf = xcl::read_binary_file(_xclbinFilename);
+        cl::Program::Binaries bins;
+        for (int i = 0; i < _numDevice; i++) {
+            bins.push_back({fileBuf.data(), fileBuf.size()});
+        }
+
+        // Create OpenCL context
+        OCL_CHECK(err, context = cl::Context(devices, nullptr, nullptr, nullptr, &err));
+
+        // Create OpenCL program from binary file
+        OCL_CHECK(err, program = cl::Program(context, devices, bins, nullptr, &err));
+
+        std::cout << "Device programmed successfully" << std::endl;
+
+        // Create OpenCL program, and command queues for each device
+        comQueues.resize(_numDevice);
+        for (int i = 0; i < _numDevice; i++) {
+            comQueues[i].resize(_numCU);
+            // Create OpenCL out-of-order command queues (One per compute unit)
+            for (int j = 0; j < _numCU; j++) {
+                comQueues[i][j] = cl::CommandQueue(context, devices[i],
+                                                   CL_QUEUE_PROFILING_ENABLE | CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE);
+            }
+        }
+    }
+
+    /**
+     * \brief Creates worker objects for each compute unit on each device
+     * \param workersPerCU Number of worker objects that will drive each compute unit
+     */
+    void createWorkers(int workersPerCU) {
+        _workersPerCU = workersPerCU;
+
+        // Construct workers
+        workers.reserve(_numCU * _workersPerCU);
+        for (int d = 0; d < _numDevice; d++) {
+            for (int cu = 0; cu < _numCU; cu++) {
+                for (int w = 0; w < _workersPerCU; w++) {
+                    workers.emplace_back(d, d * (_numCU * _workersPerCU) + cu * _workersPerCU + w, _batchsize,
+                                         _sampleInputSize, _sampleOutputSize, comQueues[d][cu]);
+                }
+            }
+        }
+
+        // Initialize workers
+        for (int d = 0; d < _numDevice; d++) {
+            for (int cu = 0; cu < _numCU; cu++) {
+                for (int w = 0; w < _workersPerCU; w++) {
+                    workers[d * (_numCU * _workersPerCU) + cu * _workersPerCU + w].initialize(context, program, cu + 1);
+                }
+            }
+        }
+    }
+
+    /**
+     * \brief Loads data from a file into batches and distribute evenly amongst Workers.
+     * \param fin Filename
+     * \param s Input type. VitisAccelerator Backend currently uses text input. However,
+     * the code also supports binary input in the format produced by NumPy's toFile().
+     * \param profilingDataRepeat Additional number of times the given data is iterated
+     * over. Profiling is enabled if this is greater than 0.
+     */
+    void loadData(const std::string &fin, int profilingDataRepeat = 0) {
+        // Set-up containers for each Worker's batches/workload
+        batchedData.reserve(_numCU * _workersPerCU * _numDevice);
+        for (int i = 0; i < _numCU * _workersPerCU * _numDevice; i++) {
+            batchedData.emplace_back();
+        }
+
+        // Batch and distribute data
+        db = new DataBatcher<T, U>(_batchsize, _sampleInputSize, _sampleOutputSize, _numCU * _workersPerCU * _numDevice,
+                                   profilingDataRepeat);
+        db->read(fin);
+        db->createResultBuffers();
+        db->batch(batchedData);
+    }
+
+    /**
+     * \brief Workers evaluate all loaded data. Each worker uses a separate thread.
+     */
+    void evaluateAll() {
+        // Check that data has been loaded and batched
+        if (batchedData.size() == 0 || db == nullptr) {
+            throw std::runtime_error("No data loaded");
+        }
+
+        std::cout << "Starting FPGA run" << std::endl;
+
+        auto ts_start = std::chrono::system_clock::now();
+
+        std::vector<std::thread> accelThreads;
+        accelThreads.reserve(_numCU * _workersPerCU * _numDevice);
+        for (int i = 0; i < _numCU * _workersPerCU * _numDevice; i++) {
+            accelThreads.emplace_back([this, i]() { this->workers[i].evalLoop(this->batchedData[i]); });
+        }
+        for (int i = 0; i < _numCU * _workersPerCU * _numDevice; i++) {
+            accelThreads[i].join();
+        }
+
+        for (auto deviceQueue : comQueues) {
+            for (auto queue : deviceQueue) {
+                OCL_CHECK(err, err = queue.finish());
+            }
+        }
+
+        auto ts_end = std::chrono::system_clock::now();
+
+        uint64_t ns_elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(ts_end - ts_start).count();
+        if (db->isProfilingMode()) {
+            double profilingThroughput = 1.0e9 * static_cast<double>(db->getProfilingSampleCount()) / ns_elapsed;
+            std::cout << "Processed " << db->getProfilingSampleCount() << " samples in " << ns_elapsed / 1000000 << " ms"
+                      << std::endl;
+            std::cout << "Profiling throughput: " << profilingThroughput << " predictions/second" << std::endl;
+        } else {
+            double throughput = 1.0e9 * static_cast<double>(db->getSampleCount()) / ns_elapsed;
+            double maxThroughput = 1.0e9 * static_cast<double>(db->getPaddedSampleCount()) / ns_elapsed;
+            std::cout << "Utilized throughput: " << throughput << " predictions/second" << std::endl;
+            std::cout << "Max possible throughput: " << maxThroughput << " predictions/second" << std::endl;
+        }
+    }
+
+    void checkResults(const std::string &ref) {
+        if (db == nullptr) {
+            throw std::runtime_error("No data loaded");
+        }
+        if (db->readReference(ref)) {
+            db->checkResults();
+        } else {
+            std::cout << "No reference file provided, skipping results check" << std::endl;
+        }
+    }
+
+    /**
+     * \brief Writes results, in text format, to provided file. Releases resources
+     * \param fout Filename. If file already exists, it will be overwritten with current results.
+     */
+    void saveResults(const std::string &fout) {
+        if (db == nullptr) {
+            throw std::runtime_error("No data loaded");
+        }
+        db->write(fout);
+        db->closeFile();
+    }
+
+  private:
+    int _batchsize;
+    int _sampleInputSize;
+    int _sampleOutputSize;
+    int _numCU;
+    int _numDevice;
+    std::string _xclbinFilename;
+    std::string _deviceName;
+
+    /// @brief A list of connected AMD/Xilinx devices
+    std::vector<cl::Device> devices;
+    /// @brief OpenCL Program that each compute unit executes
+    cl::Program program;
+    /// @brief OpenCL Device Context
+    cl::Context context;
+    /// @brief OpenCL Command Queues for each compute unit
+    std::vector<std::vector<cl::CommandQueue>> comQueues;
+    /// @brief Error code storage
+    cl_int err;
+
+    int _workersPerCU = 0;
+    /// @brief Workers, indexed by (i_CU * _workersPerCU + i_worker)
+    std::vector<Worker<T, U>> workers;
+    /// @brief Data Batcher
+    DataBatcher<T, U> *db = nullptr;
+    /// @brief A vector containing each Worker's batches/workload
+    std::vector<std::list<Batch<T, U>>> batchedData;
+};
diff --git a/hls4ml/templates/vitis_accelerator/libs/Params.hpp b/hls4ml/templates/vitis_accelerator/libs/Params.hpp
new file mode 100644
index 0000000000..b2ddf66c96
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/libs/Params.hpp
@@ -0,0 +1,107 @@
+#pragma once
+
+#include <ctype.h>
+#include <getopt.h>
+#include <iostream>
+#include <stdio.h>
+#include <stdlib.h>
+
+#include "../kernel_wrapper.h"
+#include "FpgaObj.hpp"
+
+class Params {
+
+  public:
+    Params(int argc, char **argv) {
+        int opt, temp;
+        while ((opt = getopt(argc, argv, "x:vhr:n:i:o:d:c:")) != EOF)
+            switch (opt) {
+            case 'd':
+                deviceBDFs.push_back(optarg);
+                break;
+            case 'x':
+                xclbinFilename = optarg;
+                break;
+            case 'i':
+                inputFilename = optarg;
+                break;
+            case 'o':
+                outputFilename = optarg;
+                break;
+            case 'c':
+                temp = atoi(optarg);
+                if (temp > 0 && temp < NUM_CU)
+                    ;
+                numCU = temp;
+                break;
+            case 'n':
+                temp = atoi(optarg);
+                if (temp > 0)
+                    numWorker = temp;
+                break;
+            case 'r':
+                dataRepeatCount = atoi(optarg);
+                break;
+            case 'v':
+                verbose++;
+                break;
+            case 'h':
+                help();
+                exit(0);
+            default:
+                std::cout << std::endl;
+                abort();
+            }
+
+        if (verbose > 0)
+            print();
+    }
+
+    void help(void) {
+        std::cout << "Options:" << std::endl;
+        std::cout << "  -d: device BDF (can be specified multiple times)" << std::endl;
+        std::cout << "  -x: XCLBIN path" << std::endl;
+        std::cout << "  -i: input file" << std::endl;
+        std::cout << "  -o: output file" << std::endl;
+        std::cout << "  -c: maximum computing units count" << std::endl;
+        std::cout << "  -n: maximum workers count" << std::endl;
+        std::cout << "  -r: input data repeat count" << std::endl;
+        std::cout << "  -v: enable verbose output" << std::endl;
+        std::cout << "  -h: this helps message" << std::endl;
+    }
+
+    void print(void) {
+        std::cout << "Run parameters:" << std::endl;
+        std::cout << "   xclbinFilename: " << xclbinFilename << std::endl;
+        std::cout << "        batchSize: " << batchSize << std::endl;
+        std::cout << "  sampleInputSize: " << sampleInputSize << std::endl;
+        std::cout << " sampleOutputSize: " << sampleOutputSize << std::endl;
+        std::cout << "            numCU: " << numCU << std::endl;
+        std::cout << "    inputFilename: " << inputFilename << std::endl;
+        std::cout << "   outputFilename: " << outputFilename << std::endl;
+        std::cout << "        numWorker: " << numWorker << std::endl;
+        std::cout << "  dataRepeatCount: " << dataRepeatCount << std::endl;
+    }
+
+    // Device
+    std::vector<std::string> deviceBDFs;
+
+    // Bitstream
+    std::string xclbinFilename = "./build_hw_rel/kernel_wrapper.xclbin";
+    size_t batchSize = BATCHSIZE;
+    const size_t sampleInputSize = INSTREAMSIZE;
+    const size_t sampleOutputSize = OUTSTREAMSIZE;
+    size_t numCU = NUM_CU;
+
+    // Data paths
+    std::string inputFilename = "./tb_data/tb_input_features.dat";
+    std::string referenceFilename = "tb_data/tb_output_predictions.dat";
+    std::string outputFilename = "./tb_data/hw_results.dat";
+
+    // Workers
+    int numWorker = NUM_WORKER;
+
+    // Benchmark
+    int dataRepeatCount = -1;
+    int verbose = 0;
+};
diff --git a/hls4ml/templates/vitis_accelerator/libs/Types.hpp b/hls4ml/templates/vitis_accelerator/libs/Types.hpp
new file mode 100644
index 0000000000..0ff3bed610
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/libs/Types.hpp
@@ -0,0 +1,8 @@
+#pragma once
+
+#include <cstdint>
+
+template <class T, class U> struct Batch {
+    const T *dataIn;
+    U *dataOut;
+};
diff --git a/hls4ml/templates/vitis_accelerator/libs/Worker.hpp b/hls4ml/templates/vitis_accelerator/libs/Worker.hpp
new file mode 100644
index 0000000000..5174936f24
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/libs/Worker.hpp
@@ -0,0 +1,151 @@
+#pragma once
+
+#include <cstdint>
+#include <cstring>
+#include <list>
+#include <mutex>
+#include <string>
+#include <vector>
+
+#include "Types.hpp"
+#include "xcl2.hpp"
+
+template <class T, class U> class Worker {
+  public:
+    /**
+     * \brief Constructor
+     * \param batchsize Number of samples
+     * \param sampleInputSize Flattened length of a single input to the model
+     * \param sampleOutputSize Flattened length of a single output from the model
+     * \param commandQueue cl::CommandQueue that the worker will enqueue operations to
+     * \param queueMutex Mutex protecting the CommandQueue (potentially shared with other workers)
+     */
+    Worker(int deviceId, int workerId, int batchsize, int sampleInputSize, int sampleOutputSize, cl::CommandQueue &queue)
+        : _deviceId(deviceId), _workerId(workerId), _batchsize(batchsize), _sampleInputSize(sampleInputSize),
+          _sampleOutputSize(sampleOutputSize), _queue(queue), writeEvents(1), executionEvents(1) {
+        memmap_in.resize(_batchsize * _sampleInputSize, T(0.0f));
+        memmap_out.resize(_batchsize * _sampleOutputSize, U(0.0f));
+    }
+
+    /**
+     * \brief Initializes all resources the Worker needs to drive a compute unit.
+     * \param context cl::Context of the FPGA.
+     * \param program cl:Program of the FPGA.
+     * \param computeUnit The number of the physical compute unit this worker will use.
+     */
+    void initialize(cl::Context &context, cl::Program &program, int computeUnit) {
+        cl_int err;
+
+        // This is AMD's format for specifying the Compute Unit a kernel object uses
+        std::string krnl_name = "kernel_wrapper:{kernel_wrapper_" + std::to_string(computeUnit) + "}";
+
+        // Creating Kernel object
+        OCL_CHECK(err, krnl = cl::Kernel(program, krnl_name.c_str(), &err));
+
+        // Per AMD documentation we can leave XRT infer the bank location for the buffer:
+        // " The XRT can obtain the bank location for the buffer if the buffer
+        //   is used for setting the kernel arguments right after the buffer
+        //   creation, i.e. before any enqueue operation on the buffer."
+
+        const size_t vector_in_size_bytes = sizeof(T) * _batchsize * _sampleInputSize;
+        const size_t vector_out_size_bytes = sizeof(U) * _batchsize * _sampleOutputSize;
+
+        OCL_CHECK(err, input_buffer = cl::Buffer(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, vector_in_size_bytes,
+                                                 memmap_in.data(), &err));
+
+        OCL_CHECK(err, output_buffer = cl::Buffer(context, CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, vector_out_size_bytes,
+                                                  memmap_out.data(), &err));
+
+        // Set kernel arguments will effectively affect the memory bank location
+        OCL_CHECK(err, err = krnl.setArg(0, input_buffer));
+        OCL_CHECK(err, err = krnl.setArg(1, output_buffer));
+
+        // Perform a dummy transfer input batch to FPGA to ensure that allocation time is not counted
+        // in the evaluation time. Also allows us to query the memory bank location.
+        int mem_bank_index = -1;
+        OCL_CHECK(err, err = _queue.enqueueMigrateMemObjects({input_buffer},
+                                                             0,    // 0 means from host
+                                                             NULL, // No dependencies
+                                                             &writeEvents[0]));
+        OCL_CHECK(err, err = writeEvents[0].wait());
+        OCL_CHECK(err, err = clGetMemObjectInfo(input_buffer.get(), CL_MEM_BANK, sizeof(int), &mem_bank_index, nullptr));
+
+        std::cout << "Initialized Worker " << _workerId << ", using CU " << computeUnit << " and memory bank "
+                  << mem_bank_index << " on device " << _deviceId << std::endl;
+    }
+
+    /**
+     * \brief Evaluates the single batch currently in input_buffer and writes to output_buffer.
+     */
+    void evaluate() {
+
+        cl_int err;
+
+        // Transfer input batch to FPGA
+        OCL_CHECK(err, err = _queue.enqueueMigrateMemObjects({input_buffer},
+                                                             0,    // 0 means from host
+                                                             NULL, // No dependencies
+                                                             &writeEvents[0]));
+
+        // Run kernel on the batch
+        OCL_CHECK(err, err = _queue.enqueueNDRangeKernel(krnl, 0, 1, 1, &writeEvents, &executionEvents[0]));
+
+        // Transfer output batch from FPGA
+        OCL_CHECK(err, err = _queue.enqueueMigrateMemObjects({output_buffer}, CL_MIGRATE_MEM_OBJECT_HOST, &executionEvents,
+                                                             &readEvent));
+
+        // Wait for all operations to complete
+        OCL_CHECK(err, err = readEvent.wait());
+    }
+
+    /**
+     * \brief Evaluates each batch of data provided via dataTracker. Uses float datatype
+     * \param dataTracker Vector of input locations to read from and output locations to write to
+     */
+    void evalLoop(std::list<Batch<T, U>> &dataTracker) {
+
+        while (!dataTracker.empty()) {
+            // Copy inputs into memory-mapped buffer
+            // FIXME: It there a way to avoid this copy? Could the orignal batch be used directly if aligned?
+            const T *dataLoc = dataTracker.front().dataIn;
+            memcpy(&memmap_in[0], dataLoc, _batchsize * _sampleInputSize * sizeof(T));
+
+            // Evaluate
+            evaluate();
+
+            // Copy outputs into persistent results vector
+            U *resLoc = dataTracker.front().dataOut;
+            memcpy(resLoc, &memmap_out[0], _batchsize * _sampleOutputSize * sizeof(U));
+            dataTracker.pop_front();
+        }
+    }
+
+  private:
+    int _deviceId;
+    int _workerId;
+    int _batchsize;
+    int _sampleInputSize;
+    int _sampleOutputSize;
+
+    /// @brief Reference to the OpenCL command queue
+    const cl::CommandQueue &_queue;
+
+    /// @brief Vector mapped to FPGA input buffer
+    std::vector<T, aligned_allocator<T>> memmap_in;
+    /// @brief Vector mapped to FPGA output buffer
+    std::vector<U, aligned_allocator<U>> memmap_out;
+
+    /// @brief OpenCL buffer object for input
+    cl::Buffer input_buffer;
+    /// @brief OpenCL buffer object for output
+    cl::Buffer output_buffer;
+    /// @brief OpenCL kernel object
+    cl::Kernel krnl;
+
+    /// @brief Vector tracking write events. Required by OpenCL queue functions.
+    std::vector<cl::Event> writeEvents;
+    /// @brief Vector tracking kernel execution events. Required by OpenCL queue functions.
+    std::vector<cl::Event> executionEvents;
+    /// @brief Event for signaling output transfer completion
+    cl::Event readEvent;
+};
diff --git a/hls4ml/templates/vitis_accelerator/libs/xcl2.cpp b/hls4ml/templates/vitis_accelerator/libs/xcl2.cpp
new file mode 100644
index 0000000000..6e03deb793
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/libs/xcl2.cpp
@@ -0,0 +1,174 @@
+/**
+ * Copyright (C) 2019-2021 Xilinx, Inc
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"). You may
+ * not use this file except in compliance with the License. A copy of the
+ * License is located at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations
+ * under the License.
+ */
+
+#include "xcl2.hpp"
+#include <climits>
+#include <iomanip>
+#include <sstream>
+#include <string>
+#include <sys/stat.h>
+#if defined(_WINDOWS)
+#include <io.h>
+#else
+#include <unistd.h>
+#endif
+
+namespace xcl {
+std::vector<cl::Device> get_devices(const std::string &vendor_name) {
+    size_t i;
+    cl_int err;
+    std::vector<cl::Platform> platforms;
+    OCL_CHECK(err, err = cl::Platform::get(&platforms));
+    cl::Platform platform;
+    for (i = 0; i < platforms.size(); i++) {
+        platform = platforms[i];
+        OCL_CHECK(err, std::string platformName = platform.getInfo<CL_PLATFORM_NAME>(&err));
+        if (!(platformName.compare(vendor_name))) {
+            break;
+        }
+    }
+    if (i == platforms.size()) {
+        std::cout << "Error: Failed to find Xilinx platform" << std::endl;
+        std::cout << "Found the following platforms : " << std::endl;
+        for (size_t j = 0; j < platforms.size(); j++) {
+            platform = platforms[j];
+            OCL_CHECK(err, std::string platformName = platform.getInfo<CL_PLATFORM_NAME>(&err));
+            std::cout << "Platform Name: " << platformName.c_str() << std::endl;
+        }
+        exit(EXIT_FAILURE);
+    }
+    // Getting ACCELERATOR Devices and selecting 1st such device
+    std::vector<cl::Device> devices;
+    OCL_CHECK(err, err = platform.getDevices(CL_DEVICE_TYPE_ACCELERATOR, &devices));
+    return devices;
+}
+
+std::vector<cl::Device> get_xil_devices() { return get_devices("Xilinx"); }
+
+cl::Device find_device_bdf(const std::vector<cl::Device> &devices, const std::string &bdf) {
+    char device_bdf[20];
+    cl_int err;
+    cl::Device device;
+    int cnt = 0;
+    for (uint32_t i = 0; i < devices.size(); i++) {
+        OCL_CHECK(err, err = devices[i].getInfo(CL_DEVICE_PCIE_BDF, &device_bdf));
+        if (bdf == device_bdf) {
+            device = devices[i];
+            cnt++;
+            break;
+        }
+    }
+    if (cnt == 0) {
+        std::cout << "Invalid device bdf: " << bdf << ". Please check and provide valid bdf\n";
+        exit(EXIT_FAILURE);
+    }
+    return device;
+}
+cl_device_id find_device_bdf_c(cl_device_id *devices, const std::string &bdf, cl_uint device_count) {
+    char device_bdf[20];
+    cl_int err;
+    cl_device_id device;
+    int cnt = 0;
+    for (uint32_t i = 0; i < device_count; i++) {
+        err = clGetDeviceInfo(devices[i], CL_DEVICE_PCIE_BDF, sizeof(device_bdf), device_bdf, 0);
+        if (err != CL_SUCCESS) {
+            std::cout << "Unable to extract the device BDF details\n";
+            exit(EXIT_FAILURE);
+        }
+        if (bdf == device_bdf) {
+            device = devices[i];
+            cnt++;
+            break;
+        }
+    }
+    if (cnt == 0) {
+        std::cout << "Invalid device bdf. Please check and provide valid bdf\n";
+        exit(EXIT_FAILURE);
+    }
+    return device;
+}
+std::vector<unsigned char> read_binary_file(const std::string &xclbin_file_name) {
+    FILE *fp;
+    if ((fp = fopen(xclbin_file_name.c_str(), "r")) == nullptr) {
+        printf("ERROR: %s xclbin not available please build\n", xclbin_file_name.c_str());
+        exit(EXIT_FAILURE);
+    }
+    // Loading XCL Bin into char buffer
+    std::ifstream bin_file(xclbin_file_name.c_str(), std::ifstream::binary);
+    bin_file.seekg(0, bin_file.end);
+    auto nb = bin_file.tellg();
+    bin_file.seekg(0, bin_file.beg);
+    std::vector<unsigned char> buf;
+    buf.resize(nb);
+    bin_file.read(reinterpret_cast<char *>(buf.data()), nb);
+    return buf;
+}
+
+bool is_emulation() {
+    bool ret = false;
+    char *xcl_mode = getenv("XCL_EMULATION_MODE");
+    if (xcl_mode != nullptr) {
+        ret = true;
+    }
+    return ret;
+}
+
+bool is_hw_emulation() {
+    bool ret = false;
+    char *xcl_mode = getenv("XCL_EMULATION_MODE");
+    if ((xcl_mode != nullptr) && !strcmp(xcl_mode, "hw_emu")) {
+        ret = true;
+    }
+    return ret;
+}
+double round_off(double n) {
+    double d = n * 100.0;
+    int i = d + 0.5;
+    d = i / 100.0;
+    return d;
+}
+
+std::string convert_size(size_t size) {
+    static const char *SIZES[] = {"B", "KB", "MB", "GB"};
+    uint32_t div = 0;
+    size_t rem = 0;
+
+    while (size >= 1024 && div < (sizeof SIZES / sizeof *SIZES)) {
+        rem = (size % 1024);
+        div++;
+        size /= 1024;
+    }
+
+    double size_d = (float)size + (float)rem / 1024.0;
+    double size_val = round_off(size_d);
+
+    std::stringstream stream;
+    stream << std::fixed << std::setprecision(2) << size_val;
+    std::string size_str = stream.str();
+    std::string result = size_str + " " + SIZES[div];
+    return result;
+}
+
+bool is_xpr_device(const char *device_name) {
+    const char *output = strstr(device_name, "xpr");
+
+    if (output == nullptr) {
+        return false;
+    } else {
+        return true;
+    }
+}
+}; // namespace xcl
diff --git a/hls4ml/templates/vitis_accelerator/libs/xcl2.hpp b/hls4ml/templates/vitis_accelerator/libs/xcl2.hpp
new file mode 100644
index 0000000000..17b9feaace
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/libs/xcl2.hpp
@@ -0,0 +1,117 @@
+/**
+ * Copyright (C) 2019-2021 Xilinx, Inc
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License"). You may
+ * not use this file except in compliance with the License. A copy of the
+ * License is located at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations
+ * under the License.
+ */
+
+#pragma once
+
+#define CL_HPP_CL_1_2_DEFAULT_BUILD
+#define CL_HPP_TARGET_OPENCL_VERSION 120
+#define CL_HPP_MINIMUM_OPENCL_VERSION 120
+#define CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY 1
+#define CL_USE_DEPRECATED_OPENCL_1_2_APIS
+
+// OCL_CHECK doesn't work if call has templatized function call
+#define OCL_CHECK(error, call)                                                                                              \
+    call;                                                                                                                   \
+    if (error != CL_SUCCESS) {                                                                                              \
+        printf("%s:%d Error calling " #call ", error code is: %d\n", __FILE__, __LINE__, error);                            \
+        exit(EXIT_FAILURE);                                                                                                 \
+    }
+
+#include <CL/cl2.hpp>
+#include <CL/cl_ext_xilinx.h>
+#include <fstream>
+#include <iostream>
+// When creating a buffer with user pointer (CL_MEM_USE_HOST_PTR), under the
+// hood
+// User ptr is used if and only if it is properly aligned (page aligned). When
+// not
+// aligned, runtime has no choice but to create its own host side buffer that
+// backs
+// user ptr. This in turn implies that all operations that move data to and from
+// device incur an extra memcpy to move data to/from runtime's own host buffer
+// from/to user pointer. So it is recommended to use this allocator if user wish
+// to
+// Create Buffer/Memory Object with CL_MEM_USE_HOST_PTR to align user buffer to
+// the
+// page boundary. It will ensure that user buffer will be used when user create
+// Buffer/Mem Object with CL_MEM_USE_HOST_PTR.
+template <typename T> struct aligned_allocator {
+    using value_type = T;
+
+    aligned_allocator() {}
+
+    aligned_allocator(const aligned_allocator &) {}
+
+    template <typename U> aligned_allocator(const aligned_allocator<U> &) {}
+
+    T *allocate(std::size_t num) {
+        void *ptr = nullptr;
+
+#if defined(_WINDOWS)
+        {
+            ptr = _aligned_malloc(num * sizeof(T), 4096);
+            if (ptr == nullptr) {
+                std::cout << "Failed to allocate memory" << std::endl;
+                exit(EXIT_FAILURE);
+            }
+        }
+#else
+        {
+            if (posix_memalign(&ptr, 4096, num * sizeof(T)))
+                throw std::bad_alloc();
+        }
+#endif
+        return reinterpret_cast<T *>(ptr);
+    }
+    void deallocate(T *p, std::size_t num) {
+#if defined(_WINDOWS)
+        _aligned_free(p);
+#else
+        free(p);
+#endif
+    }
+};
+
+namespace xcl {
+std::vector<cl::Device> get_xil_devices();
+std::vector<cl::Device> get_devices(const std::string &vendor_name);
+cl::Device find_device_bdf(const std::vector<cl::Device> &devices, const std::string &bdf);
+cl_device_id find_device_bdf_c(cl_device_id *devices, const std::string &bdf, cl_uint dev_count);
+std::string convert_size(size_t size);
+std::vector<unsigned char> read_binary_file(const std::string &xclbin_file_name);
+bool is_emulation();
+bool is_hw_emulation();
+bool is_xpr_device(const char *device_name);
+class P2P {
+  public:
+    static decltype(&xclGetMemObjectFd) getMemObjectFd;
+    static decltype(&xclGetMemObjectFromFd) getMemObjectFromFd;
+    static void init(const cl_platform_id &platform) {
+        void *bar = clGetExtensionFunctionAddressForPlatform(platform, "xclGetMemObjectFd");
+        getMemObjectFd = (decltype(&xclGetMemObjectFd))bar;
+        bar = clGetExtensionFunctionAddressForPlatform(platform, "xclGetMemObjectFromFd");
+        getMemObjectFromFd = (decltype(&xclGetMemObjectFromFd))bar;
+    }
+};
+class Ext {
+  public:
+    static decltype(&xclGetComputeUnitInfo) getComputeUnitInfo;
+    static void init(const cl_platform_id &platform) {
+        void *bar = clGetExtensionFunctionAddressForPlatform(platform, "xclGetComputeUnitInfo");
+        getComputeUnitInfo = (decltype(&xclGetComputeUnitInfo))bar;
+    }
+};
+} // namespace xcl
diff --git a/hls4ml/templates/vitis_accelerator/myproject_host_cl.cpp b/hls4ml/templates/vitis_accelerator/myproject_host_cl.cpp
new file mode 100644
index 0000000000..daad501fa9
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/myproject_host_cl.cpp
@@ -0,0 +1,26 @@
+#include <string>
+
+#include "FpgaObj.hpp"
+#include "Params.hpp"
+#include "Types.hpp"
+#include "kernel_wrapper.h"
+#include "xcl2.hpp"
+
+int main(int argc, char **argv) {
+
+    Params params(argc, argv);
+
+    FpgaObj</*INTERFACE_TYPES*/> fpga(params);
+
+    fpga.createWorkers(params.numWorker);
+
+    fpga.loadData(params.inputFilename, params.dataRepeatCount);
+
+    fpga.evaluateAll();
+
+    fpga.checkResults(params.referenceFilename);
+
+    fpga.saveResults(params.outputFilename);
+
+    return EXIT_SUCCESS;
+}
diff --git a/hls4ml/templates/vitis_accelerator/nnet_utils/nnet_types.h b/hls4ml/templates/vitis_accelerator/nnet_utils/nnet_types.h
new file mode 100644
index 0000000000..f92cf6887a
--- /dev/null
+++ b/hls4ml/templates/vitis_accelerator/nnet_utils/nnet_types.h
@@ -0,0 +1,64 @@
+#ifndef NNET_TYPES_H_
+#define NNET_TYPES_H_
+
+#include <assert.h>
+#include <cstddef>
+#include <cstdio>
+
+namespace nnet {
+
+// Fixed-size array
+template <typename T, unsigned N> struct array {
+    typedef T value_type;
+    static const unsigned size = N;
+
+    T data[N];
+
+    T &operator[](size_t pos) { return data[pos]; }
+
+    const T &operator[](size_t pos) const { return data[pos]; }
+
+    array &operator=(const array &other) {
+        //        if (&other == this)
+        //            return *this;
+
+        assert(N == other.size && "Array sizes must match.");
+
+        for (unsigned i = 0; i < N; i++) {
+            #pragma HLS UNROLL
+            data[i] = other[i];
+        }
+        return *this;
+    }
+};
+
+// Generic lookup-table implementation, for use in approximations of math functions
+template <typename T, unsigned N, T (*func)(T)> class lookup_table {
+  public:
+    lookup_table(T from, T to) : range_start(from), range_end(to), base_div(ap_uint<16>(N) / T(to - from)) {
+        T step = (range_end - range_start) / ap_uint<16>(N);
+        for (size_t i = 0; i < N; i++) {
+            T num = range_start + ap_uint<16>(i) * step;
+            T sample = func(num);
+            samples[i] = sample;
+        }
+    }
+
+    T operator()(T n) const {
+        int index = (n - range_start) * base_div;
+        if (index < 0)
+            index = 0;
+        else if (index > N - 1)
+            index = N - 1;
+        return samples[index];
+    }
+
+  private:
+    T samples[N];
+    const T range_start, range_end;
+    ap_fixed<20, 16> base_div;
+};
+
+} // namespace nnet
+
+#endif
diff --git a/hls4ml/templates/vivado/build_prj.tcl b/hls4ml/templates/vivado/build_prj.tcl
index 05d4b8a4d5..e03674309b 100644
--- a/hls4ml/templates/vivado/build_prj.tcl
+++ b/hls4ml/templates/vivado/build_prj.tcl
@@ -151,15 +151,26 @@ if {$opt(reset)} {
 } else {
     open_project ${project_name}_prj
 }
+
 set_top ${project_name}
 add_files firmware/${project_name}.cpp -cflags "-std=c++0x"
 add_files -tb ${project_name}_test.cpp -cflags "-std=c++0x"
 add_files -tb firmware/weights
 add_files -tb tb_data
-if {$opt(reset)} {
-    open_solution -reset "solution1"
+
+if {[string equal "$backend" "vitisaccelerator"]} {
+    set flow "vitis"
+    if {$opt(reset)} {
+        open_solution -flow_target ${flow} -reset "solution1"
+    } else {
+        open_solution -flow_target ${flow} "solution1"
+    }
 } else {
-    open_solution "solution1"
+    if {$opt(reset)} {
+        open_solution -reset "solution1"
+    } else {
+        open_solution "solution1"
+    }
 }
 catch {config_array_partition -maximum_size $maximum_size}
 config_compile -name_max_length 80
diff --git a/hls4ml/writer/__init__.py b/hls4ml/writer/__init__.py
index 8de19fe1d2..765afdbe6e 100644
--- a/hls4ml/writer/__init__.py
+++ b/hls4ml/writer/__init__.py
@@ -2,6 +2,7 @@
 from hls4ml.writer.oneapi_writer import OneAPIWriter
 from hls4ml.writer.quartus_writer import QuartusWriter
 from hls4ml.writer.symbolic_writer import SymbolicExpressionWriter
+from hls4ml.writer.vitis_accelerator_writer import VitisAcceleratorWriter
 from hls4ml.writer.vitis_writer import VitisWriter
 from hls4ml.writer.vivado_accelerator_writer import VivadoAcceleratorWriter
 from hls4ml.writer.vivado_writer import VivadoWriter
@@ -10,6 +11,7 @@
 register_writer('Vivado', VivadoWriter)
 register_writer('VivadoAccelerator', VivadoAcceleratorWriter)
 register_writer('Vitis', VitisWriter)
+register_writer('VitisAccelerator', VitisAcceleratorWriter)
 register_writer('Quartus', QuartusWriter)
 register_writer('oneAPI', OneAPIWriter)
 register_writer('Catapult', CatapultWriter)
diff --git a/hls4ml/writer/vitis_accelerator_writer.py b/hls4ml/writer/vitis_accelerator_writer.py
new file mode 100644
index 0000000000..bfd385d9b1
--- /dev/null
+++ b/hls4ml/writer/vitis_accelerator_writer.py
@@ -0,0 +1,330 @@
+import os
+from shutil import copy, copytree
+
+from hls4ml.writer.vitis_writer import VitisWriter
+
+
+class VitisAcceleratorWriter(VitisWriter):
+    def __init__(self):
+
+        super().__init__()
+
+    def create_accelerator_config(self, model):
+        from hls4ml.backends import VitisAcceleratorConfig
+
+        self.vitis_accelerator_config = VitisAcceleratorConfig(model.config)
+
+    def write_parameters_overrides(self, model):
+        """Write the C++ layer config file (parameters.h)
+
+        Args:
+            model (ModelGraph): the hls4ml model.
+        """
+        filedir = os.path.dirname(os.path.abspath(__file__))
+        f = open(os.path.join(filedir, "../templates/vivado/firmware/parameters.h"))
+        fout = open(f"{model.config.get_output_dir()}/firmware/parameters.h", "w")
+
+        for line in f.readlines():
+            if "// hls-fpga-machine-learning insert includes" in line:
+                newline = line
+                for include in sorted(
+                    set(
+                        sum(
+                            (layer.get_attr("include_header", []) for layer in model.get_layers()),
+                            [],
+                        )
+                    )
+                ):
+                    newline += '#include "%s"\n' % include
+                newline += '#include "defines.h"'
+
+            elif "// hls-fpga-machine-learning insert weights" in line:
+                newline = line
+                for layer in model.get_layers():
+                    for w in layer.get_weights():
+                        if w.storage.lower() != "bram":
+                            newline += f'#include "weights/{w.name}.h"\n'
+
+            elif "// hls-fpga-machine-learning insert layer-config" in line:
+                newline = line
+                for layer in model.get_layers():
+                    config = layer.get_attr("config_cpp", None)
+                    if config:
+                        newline += "// " + layer.name + "\n"
+                        newline += config + "\n"
+            else:
+                newline = line
+            fout.write(newline)
+        f.close()
+        fout.close()
+
+    def write_build_script_backend_override(self, model):
+        # project.tcl
+        f = open(f"{model.config.get_output_dir()}/project.tcl", "w")
+        f.write("variable project_name\n")
+        f.write(f'set project_name "{model.config.get_project_name()}"\n')
+        f.write("variable backend\n")
+        f.write('set backend "vitisaccelerator"\n')
+        f.write("variable part\n")
+        f.write('set part "{}"\n'.format(model.config.get_config_value("Part")))
+        f.write("variable clock_period\n")
+        f.write("set clock_period {}\n".format(model.config.get_config_value("ClockPeriod")))
+        f.write("variable clock_uncertainty\n")
+        f.write("set clock_uncertainty {}\n".format(model.config.get_config_value("ClockUncertainty", "12.5%")))
+        f.close()
+
+    def write_kernel(self, model):
+        """Write the Python-C++ kernel (kernel_wrapper.cpp & kernel_wrapper.h)
+
+        Args:
+            model (ModelGraph): the hls4ml model.
+        """
+        filedir = os.path.dirname(os.path.abspath(__file__))
+        io_type = model.config.get_config_value("IOType")
+
+        # Writing header file
+        f_header = open(os.path.join(filedir, "../templates/vitis_accelerator/kernel_wrapper.h"))
+        fout_header = open(f"{model.config.get_output_dir()}/kernel_wrapper.h", "w")
+        model_inputs = model.get_input_variables()
+        model_outputs = model.get_output_variables()
+        if len(model_inputs) != 1 or len(model_outputs) != 1:
+            raise Exception("Accelerator currently only supports projects with a single input and a single output variable")
+        inp = model_inputs[0]
+        out = model_outputs[0]
+        for line in f_header.readlines():
+            if "// hls-fpga-machine-learning accelerator parameters" in line:
+                newline = ""
+                newline += "#define NUM_CU " + format(self.vitis_accelerator_config.get_num_kernel()) + "\n"
+                newline += "#define NUM_WORKER " + format(self.vitis_accelerator_config.get_num_worker()) + "\n"
+                newline += "#define NUM_CHANNEL "
+                if self.vitis_accelerator_config.get_memory_type() == "hbm":
+                    newline += (
+                        format(
+                            self.vitis_accelerator_config.get_memory_channel_count()
+                            // (2 * self.vitis_accelerator_config.get_num_kernel())
+                        )
+                        + "\n"
+                    )
+                elif self.vitis_accelerator_config.get_memory_type() == "ddr":
+                    newline += "1\n"
+                newline += "#define BATCHSIZE " + format(self.vitis_accelerator_config.get_batchsize()) + "\n"
+            elif "// hls-fpga-machine-learning accelerator io" in line:
+                newline = ""
+                if io_type == "io_parallel":
+                    newline += "#define DATA_SIZE_IN " + format(inp.size_cpp()) + "\n"
+                    newline += "#define INSTREAMSIZE DATA_SIZE_IN" + "\n\n"
+                    newline += "#define DATA_SIZE_OUT " + format(out.size_cpp()) + "\n"
+                    newline += "#define OUTSTREAMSIZE DATA_SIZE_OUT" + "\n\n"
+                    newline += "typedef " + format(inp.type.name) + " in_buffer_t;\n"
+                    newline += "typedef " + format(out.type.name) + " out_buffer_t;\n"
+                elif io_type == "io_stream":
+                    dims, _ = zip(*inp.get_shape())
+                    dims = list(dims)
+                    nnet_array_depth = dims.pop()
+                    dims.append("1")
+                    newline += "#define DATA_SIZE_IN " + " * ".join(dims) + "\n"
+                    newline += "#define NNET_ARRAY_DEPTH " + format(nnet_array_depth) + "\n"
+                    newline += "#define INSTREAMSIZE (DATA_SIZE_IN * NNET_ARRAY_DEPTH)" + "\n\n"
+                    newline += "#define DATA_SIZE_OUT " + format(out.size_cpp()) + "\n"
+                    newline += "#define OUTSTREAMSIZE DATA_SIZE_OUT" + "\n\n"
+                    newline += "typedef " + inp.type.precision.definition_cpp() + " in_buffer_t;\n"
+                    newline += "typedef " + out.type.precision.definition_cpp() + " out_buffer_t;\n"
+            else:
+                newline = line
+            fout_header.write(newline)
+        f_header.close()
+        fout_header.close()
+
+        # Writing source file
+        f_source = open(
+            os.path.join(
+                filedir,
+                "../templates/vitis_accelerator/kernel_wrapper_" + io_type + ".cpp",
+            )
+        )
+        fout_source = open(f"{model.config.get_output_dir()}/kernel_wrapper.cpp", "w")
+        isHwQuant = self.vitis_accelerator_config.get_hw_quant()
+        for line in f_source.readlines():
+            if "myproject" in line:
+                newline = line.replace("myproject", format(model.config.get_project_name()))
+            elif "/*IN_HW_QUANT*/ " in line:
+                newline = line.replace("/*IN_HW_QUANT*/ ", "(in_buffer_t)" if isHwQuant else "")
+            elif "/*OUT_HW_QUANT*/ " in line:
+                newline = line.replace("/*OUT_HW_QUANT*/ ", "(float)" if isHwQuant else "")
+            else:
+                newline = line
+
+            if "/*IN_INTERFACE_TYPE*/" in newline:
+                newline = newline.replace("/*IN_INTERFACE_TYPE*/", ("float" if isHwQuant else "in_buffer_t"))
+            if "/*OUT_INTERFACE_TYPE*/" in newline:
+                newline = newline.replace("/*OUT_INTERFACE_TYPE*/", ("float" if isHwQuant else "out_buffer_t"))
+
+            fout_source.write(newline)
+        f_source.close()
+        fout_source.close()
+
+    def write_host(self, model):
+        """Write the OpenCL-based host code (myproject_host_cl.cpp) and associated libraries
+
+        Args:
+            model (ModelGraph): the hls4ml model.
+        """
+        # Write host code
+        filedir = os.path.dirname(os.path.abspath(__file__))
+        f = open(os.path.join(filedir, "../templates/vitis_accelerator/myproject_host_cl.cpp"))
+        fout = open(
+            f"{model.config.get_output_dir()}/{model.config.get_project_name()}_host_cl.cpp",
+            "w",
+        )
+        memoryType = self.vitis_accelerator_config.get_memory_type()
+        isHwQuant = self.vitis_accelerator_config.get_hw_quant()
+        for line in f.readlines():
+            if "/*FPGA_Type*/" in line:
+                newline = line.replace("/*FPGA_Type*/", memoryType.upper())
+            elif "/*INTERFACE_TYPES*/" in line:
+                dataTypes = "float, float" if isHwQuant else "in_buffer_t, out_buffer_t"
+                newline = line.replace("/*INTERFACE_TYPES*/", dataTypes)
+            else:
+                newline = line
+            fout.write(newline)
+        f.close()
+        fout.close()
+
+        # Write libraries
+        src = os.path.join(filedir, "../templates/vitis_accelerator/libs")
+        dst = f"{model.config.get_output_dir()}/libs"
+        copytree(src, dst, copy_function=copy, dirs_exist_ok=True)
+
+    def write_makefile(self, model):
+        """Write the Python-C++ Makefile (Makefile)
+
+        Args:
+            model (ModelGraph): the hls4ml model.
+        """
+        filedir = os.path.dirname(os.path.abspath(__file__))
+        f = open(os.path.join(filedir, "../templates/vitis_accelerator/Makefile"))
+        fout = open(f"{model.config.get_output_dir()}/Makefile", "w")
+
+        board_type = self.vitis_accelerator_config.get_board_type()
+        project_name = format(model.config.get_project_name())
+        for line in f.readlines():
+            if "#PRJNAME" in line:
+                newline = line.replace("#PRJNAME", project_name)
+            elif "#BOARDTYPE" in line:
+                newline = line.replace("#BOARDTYPE", board_type)
+            else:
+                newline = line
+            fout.write(newline)
+        f.close()
+        fout.close()
+
+    def write_accelerator_card_cfg(self, model):
+        """Write the configuration file passed to Vivado/Vitis (accelerator_card.cfg)
+
+        Args:
+            model (ModelGraph): the hls4ml model.
+        """
+        # Write accelerator_card.cfg
+        filedir = os.path.dirname(os.path.abspath(__file__))
+        f = open(os.path.join(filedir, "../templates/vitis_accelerator/accelerator_card.cfg"))
+        fout = open(f"{model.config.get_output_dir()}/accelerator_card.cfg", "w")
+
+        memory_type = self.vitis_accelerator_config.get_memory_type()
+        num_kernels = self.vitis_accelerator_config.get_num_kernel()
+        num_channels = self.vitis_accelerator_config.get_memory_channel_count()
+        if memory_type == "hbm":
+            if num_kernels > 4:
+                print(
+                    "WARNING: You are trying to instantiate too many kernels on the FPGA."
+                    "Synthesis is likely to fail due to resource shortage"
+                )
+            num_channels_per_cu = num_channels // (num_kernels * 2)
+        elif memory_type == "ddr":
+            if num_kernels > self.vitis_accelerator_config.get_memory_channel_count():
+                raise Exception(
+                    format(self.vitis_accelerator_config.get_platform())
+                    + " has only "
+                    + format(num_channels)
+                    + " memory banks."
+                )
+
+        directives = self.vitis_accelerator_config.get_vivado_directives()
+
+        for line in f.readlines():
+            if "MYPLATFORM" in line:
+                newline = line.replace("MYPLATFORM", format(self.vitis_accelerator_config.get_platform()))
+            elif "# hls-fpga-machine-learning clock control" in line:
+                freq = round(1e9 / model.config.get_config_value("ClockPeriod"))
+                newline = f"clock={freq}:kernel_wrapper\n"
+            elif "# hls-fpga-machine-learning kernel control" in line:
+                newline = "[connectivity]\n"
+                newline += "nk=kernel_wrapper:" + format(num_kernels) + "\n\n"
+                if self.vitis_accelerator_config.get_board_type() == "alveo":
+                    if memory_type == "hbm":
+                        for i in range(0, num_kernels):
+                            newline += "sp=kernel_wrapper_{}.in:HBM[{}:{}]\n".format(
+                                i + 1,
+                                (i * 2) * num_channels_per_cu,
+                                ((i * 2 + 1) * num_channels_per_cu) - 1,
+                            )
+                            newline += "sp=kernel_wrapper_{}.out:HBM[{}:{}]\n".format(
+                                i + 1,
+                                (i * 2 + 1) * num_channels_per_cu,
+                                ((i + 1) * 2) * num_channels_per_cu - 1,
+                            )
+                    elif memory_type == "ddr":
+                        for i in range(0, num_kernels):
+                            newline += f"sp=kernel_wrapper_{i + 1}.in:DDR[{i}]\n"
+                            newline += f"sp=kernel_wrapper_{i + 1}.out:DDR[{i}]\n"
+                            newline += "\n"
+                        for i in range(0, num_kernels):
+                            newline += f"slr=kernel_wrapper_{i + 1}:SLR{i}\n"
+            elif "# hls-fpga-machine-learning vivado directives" in line:
+                newline = ""
+                if directives:
+                    newline += "[vivado]\n"
+                    for x in directives:
+                        newline += x + "\n"
+            else:
+                newline = line
+            fout.write(newline)
+        f.close()
+        fout.close()
+
+        # Write hls_config.tcl
+        tcl_f = open(os.path.join(filedir, "../templates/vitis_accelerator/hls_config.tcl"))
+        tcl_fout = open(f"{model.config.get_output_dir()}/hls_config.tcl", "w")
+        for line in tcl_f.readlines():
+            newline = line
+            tcl_fout.write(newline)
+        tcl_fout.write("\nset_clock_uncertainty {}\n".format(model.config.get_config_value("ClockUncertainty", "12.5%")))
+        tcl_f.close()
+        tcl_fout.close()
+
+    def write_nnet_utils_overrides(self, model):
+        """Override nnet_types.h pointer comparison
+
+        Args:
+            model (ModelGraph): the hls4ml model.
+        """
+
+        filedir = os.path.dirname(os.path.abspath(__file__))
+        srcpath = os.path.join(filedir, "../templates/vitis_accelerator/nnet_utils/")
+        dstpath = f"{model.config.get_output_dir()}/firmware/nnet_utils/"
+        copy(srcpath + "nnet_types.h", dstpath + "nnet_types.h")
+
+    def write_hls(self, model):
+        """
+        Write the HLS project. Calls the steps from VivadoWriter, adapted for Vitis
+        """
+        super().write_hls(model)
+        print("\n\nWriting Accelerator code")
+        self.create_accelerator_config(model)
+        self.write_nnet_utils_overrides(model)
+        self.write_build_script_backend_override(model)
+        self.write_parameters_overrides(model)
+        self.write_kernel(model)
+        self.write_host(model)
+        self.write_makefile(model)
+        self.write_accelerator_card_cfg(model)
+        print("Done")