-
Notifications
You must be signed in to change notification settings - Fork 36
How to compile Stefan's BASIC (and how it works)
Originally the program came in one source file basic.c. There were no headers and all Arduino code was integrated in this one source file. After the code has grown to 8000 lines and the source file was 160 kB and growing, I decided to change this. Version 1.2.1 is the last released version with just one file.
The 1.3 beta I am currently preparing and that is now in the repo comes as basic.c
, basic.h
and hardware-arduino.h
or hardware-posix.h
. These files can be compiled with gcc or the Arduino IDE. No makefile is supplied.
The monolithic 1.2 basic.c
can be compiled with MINGW gcc just like on Mac/Linux. (https://sourceforge.net/projects/mingw/). Define the MINGW macro in the code (see next section). 1.3b not yet tested.
Slightly exotic port. Use the dosify script on a Mac on basic.c
to generate a dos basic.c
source code. This script add CR and removes oneline comment. This file can be compiled with Turbo C version 2.01. It cannot be loaded in the editor because it is too big. The command line compiler tcc works well. Again, only 1.2.1 tested, 1.3b not tested or supported.
You use the file TinybasicArduino/TinybasicArduino.ino
and import it to the Arduino IDE. The file is strictly identical to basic.c. In addition to this there are three more files. basic.h
is the standard header file. It is identical to the basic.h
of the POSIX OSes. All Arduino hardware and device specific code has been moved to hardware.h
. In addition to it, there is wifisettings.h
. This is under development.
The language feature section of the code in basic.c
or TinybasicArduino/TinybasicArduino.ino
looks like this
#define HASAPPLE1
#define HASARDUINOIO
#define HASFILEIO
#define HASTONE
#define HASPULS
#define HASSTEFANSEXT
#define HASERRORMSG
#define HASVT52
#define HASFLOAT
#undef HASGRAPH
#define HASDARTMOUTH
#define HASDARKARTS
#define HASIOT
The core language set is Palo Alto Tinybasic (although I never used any code from this BASIC, only the language defintion was used). All other features of the language have to be activated by the macros in this part of the code.
- HASAPPLE1: strings, arrays, all commands related with it, logical expressions, peek and poke
- HASARDUINOIO: writing und reading from analog and digital pins, millisecond time and delay function
- HASFILEIO: commands to open and close files, reading and writing to files. This only makes sense if the target system has a filesystem. Otherwise the commands are all no operation but exist.
- HASTONE: the Arduino tone function
- HASPULSE: the Arduino pulse function
- HASSTEFANSEXT: extensions I added to the language set and that I found important, some functions like SQR and POW (exponentation), GET and PUT for single character output, memory dump, and extension to FOR NEXT, CALL and USR
- HASERRORMSG: error messages
- HASV52: VT52 escape sequence handling for displays, only makes sense if the system has a display connected
- HASFLOAT: floating point support including the functions SIN, COS, TAN, ATAN, LOG, and EXP, please read more about number types further down
- HASGRAPH: graphics functions for TFT or VGA displays
- HASDARTMOUTH: ON GOTO/GOSUB, DEF FN, DATA/READ/RESTORE commands present in the early BASIC standards
- HASDARKARTS: low level interpreter functions accessible to the user with many side effects, MALLOC and FIND to allocate memory on the heap, EVAL to convert strings to command lines at runtime and execute them
- HASIOT: under development, more to come here.
The language features are fairly independent and can be activated in any combinations. There may be combinations which are more useful than others.
After the language feature setting you will see a macro MEMSIZE. It is normally set to 0. A heuristic in the code will determine the right memory size at runtime using malloc() to allocate the BASIC memory. This will fail sometime and stability problems may arise. If this happens, the memory size can be hardwired here and the heuristic is deactivated.
For setting the hardware features open hardware.h
.
#undef USESPICOSERIAL
#undef ARDUINOPS2
#undef ARDUINOPRT
#define DISPLAYCANSCROLL
#undef ARDUINOLCDI2C
#undef LCDSHIELD
#undef ARDUINOTFT
#undef ARDUINOVGA
#define ARDUINOEEPROM
#undef ARDUINOEFS
#undef ARDUINOSD
#undef ESPSPIFFS
#undef ARDUINORTC
#undef ARDUINOWIRE
#undef ARDUINORF24
#undef ARDUINOMQTT
#undef STANDALONE
The BASIC interpreter code has quite a few device drivers and heuristics for typical hardware configurations build in. The macros in this section of the code steer these heuristics. In standard hardware configurations they should work without further chances of the code.
- USEPICOSERIAL: use the picoserial library for serial communications on Arduino 168 and 328 based platforms. Save memory and flash. Good for small systems. Doesn't work on MEGA and doesn't make sense on 32 bit platforms.
- ARDUINOPS2: use a PS2 keyboard, there are standard pins defined for this which can be changed in the code
- ARDUINOPRT: activate a second serial port for printing or to connect other devices. Software Serial is used on Arduino UNO and Nano with some standard pins, DUE and MEGA use the existing second UART channel
- DISPLAYCANSCROLL: this setting is relevant in combination with the display definitions. In essence, it creates a screen buffer and handles refresh of displays.
- ARDUINOLCDI2C: connect a display using the I2C port, supports 4x20 displays, can be adapted 2x16 or other dimensions.
- LCDSHIELD: support for the standard LCD shield, buttons and display functions are available in BASIC
- ARDUINOTFT: support of a SD1963 TFT display and CTE TFT shield for Arduino DUE and MEGA 256 including optionally the SD cards in the shield or display
- ARDUINOVGA: ESP32 VGA display with the TTGO VGA card version 1.4
- ARDUINOEEPROM: EEPROM support including external EEPROMS in real time clocks, can remain activated even if there is no EEPROM
- ARDUINOEFS: activate external EEPROMS via I2C as a filesystem. Independent of ARDUINOEEPROM.
- ARDUINOSD: SD card filesystem support for Arduino MEGA 265, DUE, ESP8266 and ESP32. Low memory boards like the UNO and NANO are not supported.
- ESPSPIFFS: SPIFFS filesystem support for the build in filesystem of an ESP8266 or ESP32. Cannot be used together with ARDUINOSD. You can have one or the other.
- ARDUINORTC: connect a DS3231 clock via I2C. Clock EEPROM access is supported. The EEPROM of the clock added to the internal EEPROM.
- ARDUINOWIRE: use I2C directly via the BASIC I/O logic (OPEN, CLOSE, ....) - only master code implemented
- ARDUINORF24: activate RF2401 wireless support - send and receive messages implemented
- ARDUINOMQTT: MQTT support and Wifi - under development.
- STANDALONE: if a keyboard and a display is available STANDALONE makes these devices standard I/O channels
- Picoserial for very small systems: https://github.com/slviajero/PicoSerial
- PS2 Keyboards: https://github.com/slviajero/PS2Keyboard (patched PS2 keyboard library)
- RTC clock: https://github.com/Naguissa/uRTCLib
- EEPROM Filesystem: https://github.com/slviajero/EepromFS
The BASIC interpreter can be used with different internal number types and lengths of number variables. They can be changes if there are special needs. You don't need to change anything here but you can. Only change the highlighted lines. The exact logic is controlled in this code section
#ifdef HASFLOAT
typedef float number_t;
const number_t maxnum=16777216;
#else
typedef int number_t;
const number_t maxnum=(number_t)~((number_t)1<<(sizeof(number_t)*8-1));
#endif
typedef unsigned short address_t;
const int numsize=sizeof(number_t);
const int addrsize=sizeof(address_t);
const int eheadersize=sizeof(address_t)+1;
**const int strindexsize=2; **
const address_t maxaddr=(address_t)(~0);
The core number type of the interpreter is number_t
. This is what you see in the interpreter when you store numbers or calculate. All variables, arrays and numbers are number type. By default this is either float or int depending on the HASFLOAT macro. You can use any other type here like double or long if the type is signed and at least as long as address_t
.
The type address_t
is not seen directly in the interpreter. It is used for addressing the storage. The type is unsigned and has to be able to enumerate the entire memory. It can be smaller but never bigger in number of bytes than number_t.
With the default settings in the code and the compilers the following will happen
- Arduino 8 bit systems: 16 bit signed integers or 32 bit floats as number type and 16 bit unsigned integers as for addressing storage. This is more then enough as the 8 bit systems have no more than 8kb RAM and 4 kb EEPROM.
- 32 bit systems: 32 bit signed integers or 32 bit floats as number type and 16 bit unsigned integers for addressing storage limiting the address space to 64 kB.
If you want to address more storage you can change address_t to a 32 bit type (i.e. unsigned int) on 32 bit systems as long as you keep number_t at least 32 bit. If you need higher accuracy on 8 bit or 32 bit systems you can use the longer types as number type. If you change a floating point number_t to a higher value you could also set maxnum to a higher value. maxnum is the biggest integer that can still be represented accurately in the floating point type.
A third parameter that can be customized here (although I don't recommend it) is strindexsize
. This variable controls the number of bytes reserved in a string to store the length. The default setting is 2 bytes meaning that strings can be at most 65535 characters long. Any other value would also be possible as long as address_t
can address the entire string.
All this is for 16 bit code. Using a different number_t changes the memory consumption of some of the structures.
Stack including stack pointer: 32 bytes
Input buffer and pointer to it: 94 bytes
Keyword buffer for progmem access: 32 bytes
Variable storage: 52 bytes
Program storage control: 10 bytes
FOR loop stack: 34 bytes
GOSUB stack: 10 bytes
Arithmetic variables: 8 bytes
Token and erros: 4 bytes
Intepreter state: 1 byte
Random number generator: 2 bytes
String/Array heap control: 2 bytes.
Input/Output flags: 4 byte.
plus some helpers.
This sums up to 300 bytes of interpreter memory with the default settings of maximum GOSUB and FOR stack of 4 levels, arithmetic stack depth of 15 and an input buffer of 92 characters.
The serial I/O code with need another 184 bytes of memory as it has its own input and output buffers.
If serial communication is used the naked interpreter needs 418 bytes plus the basic memory defined in MEMSIZE. There is enough space on an Arduino UNO for 1 kB of BASIC memory.
If picoserial is used from https://github.com/gitcnd/PicoSerial/ another 150 bytes of memory can be saved. The code activated by defining USEPICOSERIAL
uses existing buffers instead of creating new ones.
Please ignore the warning of the IDE on low memory for local variables. As most things are done with global variable the local variable stack is small. The biggest memory consumption on local variable comes from the expression function. It is recursive with a fairly high call depth as logical and arithmetic functions are evaluated by one parser. Calling from the top expression()
to the lowest level factor()
needs 8 calls i.e. return addresses.
Theoretical maximum memory is 64 kB if address_t is defined as a 16 bit quantity. Defining address type as a 32 bit quantity increases this into the MB range. This has never been tested.
The BASIC interpreter has been designed to run on really small systems and to be scalable to larger systems. The Arduino UNO still is the baseline. I want the language to be useful for these systems. Several factors control the size of program memory and the available RAM for BASIC.
The minimal configuration would be to #undef all language features and #define USEPICOSERIAL. This generates a plain Palo Alto BASIC which has no knowledge of any Arduino feature. This is about the user experience you got when you loaded your Palo Alto BASIC interpreter on an ALTAIR or IMSAI 8080 computer back in 1975 if you could afford the 2k memory extension (in 1975 memory was really expensive - see https://en.wikipedia.org/wiki/Altair_8800 for more). The Arduino 8 bit executable would be approximately 12 kB. It could run on an Arduino 168.
Activating ARDUNINOIO and ARDUINOEEPROM makes the system a useful BASIC programmed microcontroller. Analog and digital IO is possible now and programs can be saved on the EEPROM. This is way more a ALTAIR could do back then.
If you have a LCDSHIELD you could even try to set the LCDSHIELD code keeping DISPLAYCANSCROLL undefined. Currently the code compiles to a 15 kB executable. The LCD can be used know and button input is possible.
If you use BASIC on an UNO you could activate the HASAPPLE1. You get strings and arrays, two character variables and dynamic memory allocation for it. 1000 bytes of BASIC memory are safely possible on an UNO.
Defining HASSTEFANSEXT adds the several features like POW, SQR. HASTONE and HASPULSE switch on more Arduino IO functions. HASFILEIO is a minimal DOS like command set. All these features activated add up to 20 kb flash on an UNO.
The really memory hungry extensions are ARDUINOTFT with 18 k of flash memory for the UTFT library and ARDUINOSD with 11 kb of flash. These extensions are best used on the larger boards like the Arduino Mega 2560 which I used as a standalone computer.
Another memory hungry feature is floats. HASFLOATS activates the floating point support. It costs approximately 7 kB on an UNO. As a side remark, this shows how memory efficient the programmers in the 70s worked with their all assembler code. The first floating point BASIC interpreters used 8kB of memory for the entire program.
The program was written wit the the intention to use as little memory as possible, run on really small systems and be as portable as possible. Many things are done by hand without the use of standard libraries. Everything is in C. No C++ language features as used.
Global variables are used as much as possible and the C call stack is avoided to control the memory consumption. Arithmetic is happening on the BASIC stack. Even a full features system will run with a few hundred bytes of free C heap size.
All layer 2 functions will change the global variables. They cannot be used to store something while calling a function.
x, y are the number_t
accumulators and xc, yc are the 8 bit accumulators. z is a number_t
accumulator used for type conversions. It is defined as a structure. ir and ir2 are general string index pointer. top is the top of program memory and here is the current position of the interpreter in the code. st is the state of the interpreter. It can be either interactive, run or run from eeprom. There is a general input buffer ibuffer with an own index bi for processing input commands. A stack of integers stack with a stackpointer sp is used in arithemtic. It should only be accessed through push()
and pop()
. The buffer sbuffer is needed for Arduino progmem access and used for number input.
The key function is nexttoken()
. In interactive mode it reads ibuffer step by step using the pointer bi, does a lexical analysis and after isolation a token it returns. In run mode it calls gettoken()
and returns the next token from memory for program execution. If a token is found its code is returned in the global variable token. If the token has arguments they are returned in ir, xc, yc, x.
For a command keyword like PRINT or a single character token like '+' only token is defined after nexttoken()
. It contains the token code for keywords and the char value for single character tokens. All other registers are undefined after nexttoken().
If a string is found the token code is STRING, ir contains a pointer to the string with stripped '"' characters and x is the length of the string. In interactive mode ir points to the input buffer and in run mode it points directly to program memory.
If a number is found the token code is either LINENUMBER or NUMBER. The value of the number is stored in x.
In the function statement() tokens are identified and the worker functions are called. Look for xget() in the code for an example of a very simple worker function. The first thing it does is to call nexttoken()
and then decide what to do. If the expected token is found the workload code is executed and then nexttoken()
is called before leaving. Every worker function must do this because after returning to the statement()
loop a new token is needed to continue with the execution. If a function doesn't call nexttoken() before returning the interpreter may block.
Variables are put on a heap that growth from the to the memory downward. bmalloc
and bfind
do this job for you. Search of a variable is done by using xc and yc to identify the variable and the token value to identify the type. Each variable on the heap needs at least 3 bytes for this header. For strings and arrays the header is followed by the byte length of the object while variables store the value of the integer directly. bfind
scans the variable table using the size information. If you add commands, avoid using the heap functions directly. The set/get[variable, array, string] functions are a better ways of doing this. Use DUMP
to see what happens if you add variables.
Objects on the heap are not dynamic. They reserve their fixed maximum length on the heap. This is different from the classical Microsoft BASIC interpreters which implemented dynamic memory management for strings. For small systems and microcontrollers the static implementation has big advantages. No garbage collection is needed and the interpreter can be used for slow real time applications as it always as the same response time. No background garbage collection can slow down or interrupt a program.
Programs are always tokenized. This means that all keywords, the line number, constants and variable names are stored as tokens. Tokens are of type signed char. They are followed by a number of arguments. Examples for multibyte tokens in a 8 bit / 16 bit systems would be
Constants like 13456: NUMBER LOWBYTE HIGHBYTE
Line numbers: LINENUMBER LOWBYTE HIGHBYTE
String: STRING LENGTH CHAR1 .. CHARn
Variable: VARIABLE CHAR1 CHAR2
Array: ARRAYVAR CHAR1 CHAR2 ( expression )
String variable: STRINGVAR CHAR1 CHAR2
The BASIC interpreter reads token from the program storage and ignores line numbers. They are treated as terminal symbols, indicating the end of a statement but not as part of a program cursor. The only cursor is here which is the address of the next token to be loaded. here is address_t and in general an unsigned integer. This means that the interpreter doesn't know in which line it is. This is very different from the typical tinybasic implementations for Arduino and elsewhere. These interpreters use line number and text position as a cursor. If the interpreter needs to know in which line it is, it has to use findline() to scan through the program storage. As the program storage is not a chained list but a simple heap, findline() is a compute time expensive function. For this reason loops should be implemented with FOR instead of GOTOs.
On single core ESP based systems the main code has to yield the CPU periodically to allow the system to do necessary background task. An interpreter language like IoT BASIC spends almost all the CPU time in loops. Either it waits for input and output or it processes tokens from the program storage. Not yielding in these loops will crash an ESP8266 fast.
Yielding and background task handling is encapsulated in the byield()
function. It is called in the statement loop after every statement and in all input and output loops. The statement loop is runs approximately at timescale between 10 and 100 microseconds depending on the commands and the micro controller. Yielding the CPU in these intervals is more than sufficient.
Byield also triggers the network helper functions. Currently this is only the MQTT and Ethernet client function. More is to come here. The byield()
function remembers the last time it has called the helper function in lastyield
. Calling the network helpers too often will crash the system as the function causes background tasks which need to be completed before it can be called again safely. For this reason the network helper is called every 32 milliseconds followed by a 2 ms delay / yield cycle to give the system time to sort things out in the background. The parameters YIELDINTERVAL
and YIELDTIME
control this behaviour.
The BASIC DELAY
function can also not be handled directly by the Arduino IDE delay() function. As the network helper has to be called periodically, delays are broken down in 1 ms intervals and byield() is triggered every YIELDINTERVAL
milliseconds.
A second parameters controls the long term activities. LONGYIELDINTERVAL
is set to 1000 per default. Everything that needs to be done once per second but not more often is triggered in this part of byield(). Renewing DHCP leases for ethernet systems is one example.
Yielding is particularly critical during I/O operations. Input loops yield after every attempt to read a character. At output the outs() function yields after every line. No yielding is done on a character output level, as the GET and PUT statement will do the yielding anyway. Theoretically, if very long strings (>1kB) are written as one single PRINT statement to a Serial buffer there is still a risk to crash an ESP8266 with the current yield strategy.
In essence, there are 3 relevant timescales in the code
- 10 microsconds: the timescale of the statement loop and the ESP8266 yielding -> call to byield()
- 30 milliseconds: the timescale of low level networking (this is compatible with the 100ms beacon interval of Wifi) -> call to yieldfunction()
- 1 second: slow housekeeping tasks -> call to longyieldfunction()
Let's assume you would like to add a new command or function MYCOMMAND to the interpreter.
To add your own command you would first need to create a token. Review the definition section of basic.h
. Tokens are numbered from -127 for NUMBER on upward up to -24 for SLEEP. This may change in the future as I will add a few more commands later on. To add another command MYCOMMAND after SLEEP, first add a token definition line.
#define TMYCOMMAND -23
Token values have to be consecutive. It has to be -23 in our example. I used the convention that keyword tokens begin with a T.
You then will have to add the keyword string as
const char smycommand[] PROGMEM = "MYCOMMAND";
If you add it outside any #ifdef ... #endif
definition, it will always be there. Otherwise you can add it to any of the language sets and it will only be there is the language set is activatd.
The keyword has to added at the end of
const char* const keyword[] PROGMEM = {..., smycommand, 0}
Again, order matters here. Mind the #ifdef ... #endif
language set sections. Either consistently in one or outside of one. The keyword for -23 has to come after the keyword for -24.
The token then needs to added to the token array, again in the right order.
const signed char tokens[] PROGMEM = {..., TMYCOMMAND, 0}
Again, mind the #ifdefs.
After these four steps the lexical analyser will recognise the token and convert the string MYCOMMAND to the token -23 whenever it is found in the input stream. This is so complicated because the code is made for Arduino PROGMEM. This is a bit weird as Arduinos are Harvard machines. The address space of the flash memory is different from the RAM address space.
The lexer has a little peculiarity. It matches strings and once it has found a match it tokenises. This means that command strings have to be unique from the start. You could not add the commands MYCOMMAND and MYCOMMANDA in the interpreter as the latter would be tokenised to MYCOMMAND A.
For a new command, create a worker function and then add the worker function to the statement loop. Worker functions are defined in basic.c
and prototyped in basic.h
.
Your worker function that takes no argument and simply prints "Hello World" would look like this:
void xmycommand() { outs("Hello World\n"); nexttoken(); }
It would appear in the statement() function like this:
case TMYCOMMAND: xmycommand(); break;
Worker functions are always void and get no argument. They take all their information from the interpreter state variables. Look at xset()
or xget()
to see how this works.
Worker functions always need to call nexttoken()
before returning to keep the interpreter going. A fresh token needs to be stored in token for the statement loop not to get trapped in an infinite loop.
There is one pitfall. The function expression()
and functions using it like parsearguments()
already have nexttoken()
called at the end. If you use expression()
in a worker function you should not call nexttoken()
at the end of the worker function.
A command that takes would print the value of a variable would look like this:
void xmycommand() {
nexttoken();
if (token != VARIABLE) {
error(EUNKNOWN);
return;
}
outnumber(x);
nexttoken();
if (!termsymbol()) {
error(EUNKNOWN);
return;
}
nexttoken();
}
The first nexttoken() loads the next token after the command from the token stream. If it is not a variable, a SYNTAX ERROR is reported. If it is a variable, the value of the accumulator x contains the value which can be printed using the level 2 output function outnumber(). After that the next token is loaded and it is checked whether it is a terminal symbol ending a line or statement. If not an error is thrown. The last nexttoken() loads a new token and then returns to the statement loop for execution.
If you want to create a function to be evaluated in expression, you need to add it to factor().
This would look like this
case TMYCOMMAND: parsefunction(xmycommand, 1); break;
Unlike commands in statement, functions do not call nexttoken()
before returning, as they are part of the recursive decent parser and follow their own rules.
A function that would return two times the argument would look like:
void xmycommand(){ x=pop()*2; push(x); }
Functions used in factor always need to get all their arguments from the stack and then keep exactly one argument on the stack as a result.
A command with two numerical arguments would be connected like this:
case TMYCOMMAND: parsefunction(xmycommand, 2); break;
It would then have to do something like this with the stack for a function that calculates f(x,y)=2*y+x.
void xmycommand(){ x=pop()*2+pop(); push(x); }
Two pop(), one push(). A variable number of arguments is possible but a little bit more complicated to implement.
Debugging the interpreter after adding a command, error handling, segmentation faults, and other nasty things.
Using the command
set 0,1
switches on the build in debug mode of the statement loop. The token stream is logged to the console after each command.
If the interpreter hangs after a command, it may be that nexttoken()
is not called before returning. Commands functions always must make sure that a new token is created before returning. If commands are swallowed and not executed, nexttoken()
is called too often. Functions called in factor should not generate new tokens calling nexttoken()
. The recursive decent parser code takes care of this.
Errors have to reported by calling the error()
function. It takes a valid error number as an argument. EUNKNOWN is a good choice for a starter. This throws a syntax error. The global variable er
contains the error state. It can and has to be checked after calling functions that can generate errors. If an error is produced from deeper down, simply return. statement()
resets the error status after each iteration of the loop and then changes interpreter state to interactive.
Segmentation fault crashes usually mean that the keyword array is messed up. Keyword and token array have to match exactly. Token values need to be consecutive. A second source for segmentation faults is memory access and wrong string code.
Stack overrun and underrun errors usually mean that you have pop()ed or pushed() too many or too few arguments on the stack. All functions need to keep the stack clean. There is no stack reset mechanism except after error()
when going to interactive.
Crashes can also happen when there is not enough heap memory. Commands should use a minimal set of local variables and should not use the C stack for recursion. If you need more memory for the heap. Either change the offset parameters in the freememory mechanism or set MEMSIZE by hand.