Skip to content

Interest in Flopoco-based FPU, Instruction invalidation on self-modifying code or Verilator Improvements? #20

@ramonwirsch

Description

@ramonwirsch

In my fork, I have added some functionality to CVA5. I have a significantly modified environment in which I test, so cannot simply issue pull-requests right now.
Is there interest in somehow getting some or all of these changes upstream? And if so, which?

The main additions are:

  • a Flopoco-based FPU with customizable pipeline latencies across 4 CVA5-pipeline modules and FP-RegFile. FPU supports all Single-precision operations, but no subnormals, rounding modes, exceptions or FPU-CSR registers yet. Since Flopoco generates VHDL-code there is a verilator-dpi-C implementation for all of those that is a drop-in replacement for the actual Flopoco implementations. Would require additional work to regenerate matching Flopoco implementations from a user-supplied Flopoco binary instead of the pre-generated files I use in my own git submodule. But the verilator implementation runs out-of-the box. FP latencies are not yet configurable from the CPU_CONFIG structure, but could easily be, as they are already parameterized inside the pipelines.
    • The FP RF is a separate instance of the existing RF (now more parameterizable) , with 64 physical registers that are also handled by the Renamer. They are 2 bits wider than GP regs to match Flopoco's wider format. This also allows for cheaper 3r1w ports independent of the GP RF (although the FP-MAC implementation does not use the 3rd operand simultaneously. Enhanced decode & issue stages could make do with only 2 read ports). I have not investigated synthesis-impact of using a shared pool of physical registers to avoid allocating 2x 64 registers or to mitigate the need for some separate infrastructure in the renamer.
  • Optional (build-time and runtime, controlled via CSR) Instruction invalidation for all Data-writes. Instruction cache and Branch-Predictor were not kept coherent with data when that was changed, so bootloader-functionality was problematic. I have not kept up-to-date with what CVA5 can do in this regard out of the box (there seems some early-branch-flush feature to at least handle this in the predictor). This invalidation can slow down the processor a bit, as each write is signaled to a configurable number of fifos to check for needed invalidation. The invalidation is by default off and needs to be enabled via custom CRS register for as long as overwriting existing instructions is possible.
  • I have reworked the Verilator implementation with new command-line options/parsing and features. Among them:
    • Extensible. I can have my own build with different top-level file with more ports by switching out one C file and reusing all the rest
    • can terminate on reaching infinite loops (optional) or a user-exit magic-nop
    • configurable stall limit (RT-OS with RFI instructions will hit the hard coded limit very fast)
    • UART redirectable to file, including inputs. Can be used with socat to simulate bootloaders communicating over UART with the actual host-side loader tool
    • new combined format for memory contents. I have devised a text-format that lists an arbitrary number of binary files, each with offsets, ranges from which actual memory contents and reference contents can be loaded. Verilator can initialize both local memory and DRAM from this format. It is essentially a dummed down, human-readable ELF Header table, which means, in most cases, my tool (written in kotlin, complex, reads and understands ELF-files) just generates this index-file, but the actual memory contents are read from the original ELF-Binary. But additional contents can easily be mixed-in or overlayed. Since the format is sparse and supports zero-initializing it can save a lot of space compared to the existing hex files.
    • local memory and DRAM can be initialized separately, even from the existing hw_init hex formats
    • use FST format instead of VCD (but configurable at build-time). Much faster and more space-saving
    • out-of-tree: I now build Verilator with CMake, which builds faster and is more comfortable, which is also where the Flopoco source files are integrated right now
  • out-of-tree/WIP: Zephyr Port, intended to build a multi-threaded application that manages many things, including bootloading via an additional UART port (supports User-Mode, UART, but works around lack of RISC-V PMP / user mode is not actually isolated via any means

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions