Battery powered temperature monitor

Metanate's client had developed a battery powered temperature logger to be installed in a domestic setting. The device was powered by 2 AA cells and it collected temperature samples from two probes at 10s intervals, periodically sending the readings gathered to the cloud backend using a TLS connection over WiFi.

In order to give a long battery lifetime (in excess of 2 years) a dual processor architecture was used (as shown in the diagram below) which enabled the Comms Processor (an ESP8266) to be powered down when not in use.

The Control Processor was programmed in C using the Atmel Software Framework (ASF), while the Comms Processor was programmed in C using the Espressif ESP8266 Non-OS SDK.

The Control Processor is responsible for:

Managing the initial device commissioning process
Taking periodic temperature readings and storing them in flash until they are transmitted to the cloud service
Running an algorithm on the readings, the output of which may be events to be notified immediately to the service
Periodically contacting the cloud service (using the Comms Processor) to upload data and event information
Monitoring the state of the batteries
Monitoring the state of the button
Displaying system status via the LED
Performing over-the-air upgrades of one or both of the processors when instructed by a message received when contacting the platform

The Comms Processor is powered up when required by the Control Processor and provides:

a WiFi Access Point for interaction with a commissioning App
a TLS connection to the cloud service as a WiFi Client

A Metanate engineer joined the client's development team with a brief to investigate and fix some issues affecting reliability of the device when deployed operationally.

The approach taken was to stabilise the development environment by importing known versions of the two manufacturer SDKs into the git source control system, reapplying the project changes and using this source as the base to move forward. A pass of static analysis weeded out unsafe C practices and writing more unit tests meant that errors could be caught in the TeamCity build pipeline.

The main issues discovered and resolved during the project were:

Memory allocation failures in the Comms Processor.

The Tensilica Xtensa processor used in the ESP8266 is a 32-bit processor with 16-bit instructions. It is Harvard architecture which most significantly means that instruction memory and data memory are completely separate. The ESP8266 has only 80kB of user RAM which is occupied in a C environment by .data, .rodata, .bss, the stack and the heap.

The mbed TLS library dynamically allocates RAM using malloc when inititating a TLS connection and depending on the path taken to derive session encryption keys, the heap could easily be exhausted.

A concerted effort to reduce stack allocated data and the movement of constant data to the flash by careful use of linker sections and accessor functions (flash access must be via 4 byte aligned reads) reclaimed enough space to allow the mbed TLS library to connect reliably.

Subsequently, a move to release 3 of the Espressif SDK eased the situation further (since it allows allocating some of the device's instruction cache to the heap).
Unreliability of use of flash storage

The code in both processors relied upon insecure flash write methods: this led to subtle bugs in the temperature data reporting and loss of device state. Recoding to use safe updating with a two-stage commit removed this source of error.
Unreliability of WiFi operation

Reconnection to the WiFi AP sometimes failed after OTA upgrade: this was caused by a failure to initialise all fields in structures passed to the Espressif API (new fields had been added as Espressif developed the API and some of the values needed by the device were not correct when a structure initialisation to 0 was used as the default)
Unreliability of user interface control/signalling

The timing of button presses and LED UI sequences were somewhat erratic because updating of the software clock from the hardware timer suffered from various overflow issues. Recoding the hardware clock driver eliminated these errors.

The LED UI driver was rewritten to use a state machine rather than ad-hoc time tests to control its transitions: this made it much easier to understand (and easier to change as the UI was refined).
No watchdog protection

Changes were made to provide hardware watchdog protection for the Control Processor. This required careful integration because the Atmel code was essentially single-threaded: long duration operations were identified (OTA image checksum generation, image writes to flash, temperature data scanning etc) and suitable watchdog reset calls inserted.
Poor behaviour on bad WiFi connections

Testing using a Raspberry PI as a WiFi access point running the 'slow' script written by Richard Bullington-McGuire (https://gist.github.com/obscurerichard/3740206) allowed the simulation of network links with varying degrees of packet loss and low throughput. Using this technique the behaviour of the network stack running on the ESP8266 was tuned to be more resilient.