Skip to content
Daejun Park edited this page Nov 27, 2017 · 22 revisions

Semantics of Viper in K

Viper Language

Viper is a new experimental language for the smart contracts running on EVM. It is designed by Ethereum foundation, supposedly being simpler and more secure language than Solidity by factoring out the essential components in writing smart contracts, based on their experience of deploying and operating hundreds of thousands of smart contracts over the last two years.

Specifically, Viper is a kind of object-oriented language in the sense that a smart contract is represented as a class whose data fields hold the contract states and methods provide the contract functionalities. Notably, however, it has no explicit inheritance --- one of the main design decision made based on their extensive experience of writing smart contracts. They found that security is more important than programming convenience, thus the minimality has obvious advantages.

Viper is supposedly a strongly statically typed language that supports type inference. Types consist of primitive types (num of 128-bits signed integers with overflow exception, num256 of 256-bits unsigned integers with overflow exception, decimal with 128-bits of integer numbers and 10-bits of fractional numbers, bool, address of 20-bytes, and bytes32), and compound types (fixed-size arrays, structs, and maps with primitive typed keys).

A local variable has the function scope (i.e., no variable hiding in local blocks) but no declaration is needed, although it is questionable from the security perspective of the language design. It does not allow recursive functions nor unbounded-loops, but only bounded-loops with break. It adopts Python syntax, e.g., being white space sensitive.

Viper Semantics Specification

Specifying Viper language semantics is challenging. As an experimental language, Viper does not have a specification document other than the source code of its compiler to EVM, which also keeps evolving. Therefore, we took a snapshot (a886850c1d) of the latest version of the compiler at the time of writing, and revealed the semantics via reverse-engineering the source code as well as communication with the language designers.

Thanks to KEVM, we were able to specify the Viper language semantics by simply formalizing the compiler translation from Viper to EVM. Combined with KEVM, the Viper-to-EVM translation semantics suffices to present the whole semantics of Viper. This "semantics by translation" approach also helps to emphasize the essence of Viper semantics by factoring it out from EVM semantics.

We formalized the translation from Viper to EVM. Our formalization adopted the two-staged translation approach taken by the Viper compiler, that is, Viper to LLL translation followed by LLL to EVM translation, where LLL is served an intermediate representation.

Having been originally designed for Solidity compiler, LLL is a Lisp-like language but much simpler and lower-level than Lisp and closer to EVM. An LLL expression is an S-expression that is a list of an operator symbol followed by zero or more arguments of another S-expressions. Part of the LLL operators correspond to the EVM opcodes with the same name. Only a fragment of LLL is used for the intermediate representation of Viper.

Viper to LLL translation

The Viper to LLL translation mainly performs the following tasks: lowering function calls using the ABI encoding; encoding logical data structures such as structs into byte-array representation (with 32-byte alignment); allocating memory and storage for each local and state variables, resp.; inlining builtin-functions; type checking; and inserting runtime checks for arithemtic overflow. The control-flow statements are simply translated to that of LLL.

LLL to EVM translation

The LLL to EVM translation is rather straightforward. An LLL expression with the EVM opcode is translated to a seqeunce of EVM opcodes that amounts to evaluating the arguments and pushing the results into the stack in the reverse order followed by the corresponding EVM opcode. A control-flow operator such as if and repeat is translated in the usual way using the jump opcodes.

Viper language design issue

We found that Viper language is tied too much to EVM, inevitably inheriting the limitations of EVM. We stongly believe that such a high-level language as Viper should have a standalone, abstract semantics that is separate from its instantiation (i.e., translation) to its low-level target language(s) such as EVM. This helps to understand and reason about the language semantics in the proper level, as well as make it easier to apply the language to different targets. Also, this separation of concerns will help to formally verify Viper programs. One can verify the high-level logic of the smart contract in the right level of abstraction, and then prove that the Viper program simulates the translated EVM code, i.e., the EVM code does not admit exceptional behaviors (including the EVM-specific low-level exceptions) until they are defined in the original Viper program.

We discuss a couple of specific design issues to be improved.

Address space of maps. A map of Viper is a total function with domain of the entire address space of the local memory, which may lead to conflict between different maps. Currently, Viper simply relies on the hashing scheme to avoid the conflict, hoping the hash collision will not happen, but in theory, it can be broken thus not ultimately reliable. Despite the fact that the probability of the hash collision is very low and if that happens the entire block chain will be broken, we do believe that the language should provide an extra protection layer to completely prevent the conflict even in the presence of the hash collision. For example, monitoring the hashed values can detect the hash collision and the value can be re-hashed with a differnt salt if that happens.

Currently, the internal calls propagate exceptions all the way up to the root callers, which we think is induced by EVM semantics. However, we think that Viper should provide a way of catching those exceptions so that one can have more control of handling those exceptional cases in a secure way.

Bugs found

While formalizing the lauguage semantics, we found several bugs of Viper compiler, including a critical security vulnerability that is being fixed by Viper team. Those bugs were reported, confirmed, and fixed.

We cannot disclose the security vulnerability until it is fixed. Below we discuss a couple of the other bugs found.

A bug was found in handling typecasting where they failed to insert a runtime check while it should. For example, as_num256 is typecasting to num256, 256-bits unsigned integers, but it is not supposed to typecast a negative value since it will be coerced to a quite different (large) positive number. Viper compiler rejects the invalid typecasting expressions when they can be detected at compiler time, but failed to add the run-time check when they are not. For example, as_num256(-1) is rejected at compile-time, but as_num256(x - y) was simply compiled to x - y even in the case of x < y.

Another bug was found in compiling the hash expression, sha3(e). For example, sha3(0) is supposed to be translated into EVM code, PUSH1, 0, PUSH1, 192, MSTORE, PUSH1, 32, PUSH1, 192, SHA3, which means that the code first stores the value to be hashed, '0', in the local memory at the pre-designated location (here 192), and then computes SHA3 hash of the memory range (with starting location 192 and width 32, i.e., MEM[192..223]). But the actually genereated code was: PUSH1, 0, PUSH1, 192, MSTORE, PUSH1, 192, PUSH1, 32, SHA3 (i.e., PUSH1 192 and PUSH1 32 were flipped), which means that it computes SHA3 hash of the wrong memory range (with starting location 32 and witdh 192, i.e., MEM[32..223]). This seemingly obvious bug had not been found as the following reason. Fortunately (or unfortunately) the unintended memory range MEM[32...191] that is prepended to the value to be hashed (i.e., MEM[192...223]) is fixed over all Viper programs. Indeed, they are pre-occupied to store five values of size limits used to clamp values: ADDRSIZE, MAXNUM, MINNUM, MAXDECIMAL, MINDECIMAL.

Clone this wiki locally