We are the makers of low.js, a Node.JS clone for microcontrollers.
low.js runs on the ESP32-WROVER module. The ESP32 modules are probably the cheapest option (unit cost < $ 3 for larger amounts) if a microcontroller with Wifi connection is needed. You can even add an Ethernet jack (additional unit cost < $ 2), as we are demonstrating with our neonious one board. It features a 240 MHz dual-core processor, yet is energy efficient enough for battery-powered applications.
There is no need to think about things like deadlocks and memory management and if there is an error, an exception will provide meaningful information. In addition, the Node.JS API makes it a matter of minutes to interface the Internet or make the board provide a website, allowing the user to change settings in a nice GUI.
Simply compile Node.JS for the ESP32 architecture?
Of course, that is what we tried first. However, the footprint of Node.JS is just too big:
We named this new port low.js.
low.js is still not 100% done, but the core work is pretty much done. So we do not expect the footprint to become much bigger.
By the way, not only is low.js smaller, low.js also boots faster than Node.JS, which is great, because microcontroller programs should start quickly:
Same features, far smaller footprint! Where is the catch?
As you may have guessed from the title of the article, the catch is the execution speed of computing-intensive programs (but still low.js does a good job compared to alternatives):
The test program used here and throughout the rest of the article is:
var k = 0; for(var i = 0; i < 1000; i++) for(var j = 0; j < 1000; j++) k += j; console.log('Done', k);
low.js is faster, because the DukTape engine compiles the source code to bytecode. Bytecode is, just like machine code which microcontrollers execute, a compact and flat representation of the program. With DukTape bytecode every instruction is exactly 4 bytes long. 1 byte is the opcode which tells DukTape what this instruction does. The other 3 bytes are the parameters.
But still, running bytecode means that the bytecode interpreter is the actual program which runs on the microcontroller. This program has to fetch bytecode and then interpret it (blue in chart above) before doing the actual work (purple in chart above). These additional steps also prohibit the bytecode interpreter from caching often used variables into registers (ultra-fast memory locations in the processor itself), as the registers are needed to fetch and interpret the bytecode.
Node.JS is blazing fast, as the V8 engine compiles the source code to machine code. While the DukTape bytecode is being read by a program which is being executed by the microcontroller, the compiled machine code is read directly by the microcontroller. As the source code is compiled to machine code just before execution, this is called just-in-time compilation.
Mission: Just-in-time compilation benefits at a small cost
With low.js, we will never reach the execution speed of Node.JS/V8, as V8 isover-optimized for execution speed benchmarks at a price which cannot be paid for on microcontrollers.
Not only does Node.JS have a high footprint in RAM and disk space, it takes a while for the JIT compiler to compile the program, making the launch of Node.JS noticably laggy. We however want the microcontroller to still boot fast.
But, we still want to try to optimize execution speed with just-in-time compilation as far as we can without compromising much.
Challenge: Limited memory where the machine code can be run from
For just-in-time compilation we need to compile parts of the DukTape bytecode or the source code to machine code, place it in RAM from which we can execute it, and then execute it instead of interpreting the corresponding bytecode.
On the ESP32, the RAM from which machine code can be executed is called instruction RAM. It can only be accessed 32-bit aligned. According to the ESP32 Technical Manual, there are some 100+K bytes of IRAM available.
Quickly we noticed that we could only allocate a few blocks of this type of RAM, amounting to 20K-30K bytes. Too much of ESP-IDF (the framework used to program the ESP-IDF) is always loaded into IRAM for performance reasons. Also, to keep Wifi working we had to leave some memory freed, as the Wifi driver seems to allocate IRAM, too. So we only have 10K-20K bytes of instruction RAM to work with.
Thus, compiling whole modules to bytecode will not work. We can only compile parts of the code at once, freeing machine code which is no longer used to compile other code every once in a while.
The important design decisions
As we can only compile parts of the code at once, we decided to keep it simple:
(1) We will compile machine code from the DukTape bytecode, which will stay in memory (we have 4 MB for this, more than enough for typical microcontroller apps). This can be done faster than compiling machine code from source code directly. This way, if we have to free machine code for new JIT compilations, we still have the bytecode of this machine code to quickly recreate the machine code if it is needed again.
(2) We will also only compile bytecode which does not branch. This way we can simply start to compile instructions when the bytecode interpreter hits a non-branch bytecode instruction which is not compiled yet. The JIT compiler then can compile instruction after instruction until it hits a branch instruction (the last instruction of a function always is a branch instruction).
(3) What first seemed to be a difficult task was how to decide which machine code to free whenever memory is needed for a new JIT compilation. Any container structure which keeps track of the machine code blocks and figures out which machine code is used less often would use more CPU cycles than the JIT compilation could save.
The just-in-time compilation itself
This is very technical, but I am doing my best to explain it with clear language:
The bytecode interpreted of DukTape is implemented in the C source code file duk_js_executor.c. It is implemented as once big function, with a loop, which gets the next bytecode parameter and then switches to different handlers, depending on the opcode of the bytecode.
For the first version of the JIT compiler we decided not to optimize the handlers, but rather simply let the machine code call the handlers. This still help sperformance, as the microcontroller does not have to be told to fetch bytecode and interpret it by branching to a handler. The microcontroller is instead told directly to call the handlers.
For this, as a first step, we duplicated the bytecode interpreted code and moved each of the handlers into individual functions. Every function had a parameter list which includes thr and consts, variables which are used in any handler, and the parameters of the bytecode instruction (with ins being the whole instruction for handlers which do special things):
void func1(duk_hthread *thr, duk_tval *consts, int a, int b, int c) void func2(duk_hthread *thr, duk_tval *consts, int a, int b, int c, int ins) void func3(duk_hthread *thr, duk_tval *consts, int a, int bc) void func4(duk_hthread *thr, duk_tval *consts, int bc) void func5(duk_hthread *thr, duk_tval *consts, int b, int c) void func6(duk_hthread *thr, duk_tval *consts, int ins) void func7(duk_hthread *thr, duk_tval *consts) void func8(duk_hthread *thr, duk_tval *consts, int a, int b, int ins)
Now, all the JIT compiler has to do for each bytecode instruction which it compiles, is create machine code which moves the parameters of the bytecode instruction into the registers and calls the correct handler.
After the machine code is built, the bytecode instruction which triggered the JIT compilation is replaced with a new instruction with the opcode “run this machine code”. The parameter of the bytecode instruction is the address of the machine code (compressed to 24 bit, which is easily done by omitting the highest byte of the address, which is 0x40 anyways for instruction RAM on ESP32).
Now the bytecode interpreter will run several instructions by calling the machine code whenever it hits the bytecode instruction again. To which bytecode instruction the interpreter shall skip after the machine code is run is returned by the machine code itself.
And of course, when the machine code is freed, the bytecode instruction is replaced with the original bytecode instruction.
Benchmark and where to continue
This first version of a JIT compiler for low.js is an immediate success. The example program runs a bit faster:
With this version implemented we see several starting points for the next optimizations:
- After profiling which bytecode instructions run often, we can let the JIT compiler implement the heavily used ones directly in machine code.
- When this is done, it is possible to remove slow memory accesses by caching variables into registers, as the JIT compiler controls the registers whenever this bytecode instructions run, as handler functions are no longer called.
We believe with this we can optimize low.js quite easily to reach 33% of Node.JS execution speed, which is more than enough for almost all use cases.
Do you have any ideas on for improvements? We are looking forward to your comments!
Thank you for reading! If you like what we are doing, please take a look at the following things:
- We are looking for somebody to help us sell our products to companies world-wide! We are a small startup, but we have potential! Take a look at our job advertisement!
- We launched a hacking contest! Earn up to 500 USD in cash or for charity just by building cool things with low.js!
- Please take a look at www.neonious.com for a great microcontroller board with low.js, Ethernet and Wifi. The on-board integrated IDE + debugger allows you to rapidly try out stuff and have lots of fun.