top of page

Memory Safety in Mission-Critical Embedded Software - Part 1

How memory management can turn into real safety hazards 


In mission-critical embedded systems, think automotive ECUs, avionics controllers, medical devices, industrial safety PLCs, memory safety is not a “nice-to-have.” It’s a reliability and safety requirement. A single memory bug can cause a watchdog reset, corrupt sensor readings, flip a state machine into the wrong mode, or silently degrade behavior over time. In safety terms, that means a software defect can become a hazard: an unintended acceleration event, an incorrect infusion rate, a mis-triggered braking command, or a failure to detect a dangerous condition. 


To understand why memory safety failures are so dangerous, you need to see memory management not as “implementation detail,” but as a source of nondeterminism, and nondeterminism is the enemy of safety and real-time guarantees. 


Why memory safety is different in embedded safety systems


Most mission-critical embedded systems share a few basic expectations:


  • Predictable behavior: If the inputs are the same, the system should behave the same way every time.

  • On-time response: The system must respond fast enough. Being late can be just as dangerous as being wrong.

  • Limited memory: These devices have a small, fixed amount of memory. If it’s wasted or misused, the system can fail.

  • Long run times: Many systems run continuously for long periods. Small problems can build up and eventually cause failure.


Because of this, how software uses memory matters a lot. A memory issue that might only crash a phone app can lead to a real safety event in a mission-critical system.


Static vs dynamic memory allocation: predictability vs flexibility 


Static allocation (compile-time / fixed memory) 


With static allocation, the system sets aside the needed memory ahead of time, before the software starts running. This usually means using fixed-size memory blocks and buffers that are planned during design.


Why teams like it (Pros)

  1. More predictable: the system isn’t “searching for memory” while running.

  2. More stable over time: memory doesn’t get broken into unusable pieces as the system operates.

  3. Easier to review and justify : memory needs can be estimated and checked early.

  4. Fits safety certification well: predictable behavior is easier to argue and verify.


Trade-offs (Cons)

  1. Less flexible: you may reserve more memory than you actually use.

  2. Sizing mistakes hurt: if you guess too small or too big, you can create issues.

  3. Not ideal for highly variable situations: if memory needs to change a lot at runtime, static planning takes more effort.


Static allocation is favored in many safety standards and embedded guidelines because it reduces runtime uncertainty, especially important in hard real-time tasks. 


Dynamic allocation (runtime / heap-based: malloc/new) 


With dynamic allocation, the software asks for memory during operation, only when it needs it (for example, creating objects or storing data temporarily). This is the “request memory now” style used in many applications.

 

Why teams use it (Pros)

  1. Flexible: memory can be requested only when needed.

  2. Good for changing workloads: useful when data sizes vary a lot.

  3. Can reduce upfront memory reservation: you don’t always need to size everything for the worst case.


Why it’s risky for safety (Cons)

  1. Unpredictable timing: asking for memory can take longer sometimes, which can affect response time.

  2. Memory can get “chopped up”: over time, memory may be available in small pieces that aren’t usable when you need a larger block.

  3. Can fail unexpectedly: memory requests might fail after the system has been running for a long time.

  4. Harder to control lifecycle: it increases the chances of errors like memory leaks or using memory after it has been released.


Fig: Comparison of static vs dynamic memory allocation
Fig: Comparison of static vs dynamic memory allocation

Because mission-critical systems depend on predictable behavior, many safety-focused designs either avoid dynamic allocation during normal operation or allow it only in tightly controlled ways (for example, during startup).


Common memory hazards: how they become safety issues 


  1. Memory leaks: the slow-burn failure 


    A memory leak happens when the software allocates memory but never releases it. In embedded systems with limited RAM and long uptimes, leaks are particularly dangerous because they cause resource depletion over time. 


    How leaks become hazards


    Imagine an embedded system that processes frames from a 1-megapixel camera for a safety function (driver monitoring, obstacle detection, machine guarding, etc.).


    • 1 MP frame = ~1,000,000 pixels

    • Each pixel is RGB (24-bit) = 3 bytes per pixel

    • So one frame is roughly 3 MB of raw image data (1,000,000 × 3 bytes)


    Now suppose the software allocates a buffer for each frame (or for intermediate steps like resizing, filtering, or object detection), but due to a bug, it doesn’t fully free some of that memory each cycle. Even a small leak adds up quickly in continuous operation.


    For example:

    • If the leak is only 50 KB per frame, then:

      • At 10 frames per second, that’s 500 KB per second

      • That’s ~30 MB per minute

      • In a short time, the system runs low on memory


    What happens next is the safety concern: When memory gets tight, the system may slow down, start dropping frames, miss timing deadlines, or reset unexpectedly. In a safety function, that can mean the system becomes late, unstable, or temporarily blind, exactly when you need it to behave predictably.


    This is why memory leaks are especially dangerous in embedded safety systems: they can look harmless in short testing but become a time-delayed failure in real field operation.


  1. Buffer overflows: the fast catastrophic failure 


    A buffer overflow occurs when code writes beyond the bounds of an array or buffer. This can corrupt adjacent memory: stack frames, function pointers, control data, state variables. 


    Why this is terrifying in safety systems


    Imagine an automotive ECU that handles a radar-based Automatic Emergency Braking (AEB) function. Every cycle, it receives a small “object list” message: distance, relative speed, object ID, confidence, etc. To keep it simple and fast, the software stores this data in a fixed-size buffer - say it was designed to hold up to 16 detected objects.

    Now picture a real-world edge case: heavy rain, reflections, or a complex scene causes the sensor to report more than 16 objects (or a malformed message reports the wrong count). If the software does not strictly enforce bounds, the ECU may write object 17, 18, etc. past the end of the buffer.


    Here’s the safety problem: the overflow doesn’t always cause an obvious crash. Instead, it can overwrite whatever sits next in memory, often something important like:


    • the current braking mode (standby vs active)

    • a threshold value used to decide whether a collision is imminent

    • a confidence flag that marks sensor data as valid/invalid

    • a state machine variable (“monitoring”, “warning”, “braking”)


    So, the system may keep running, but with silently corrupted internal state.


    What that looks like in the vehicle:

    • The ECU may fail to brake when it should (missed hazard), or

    • It may brake unexpectedly (false positive), or

    • It may switch modes incorrectly (e.g., “all clear” when it isn’t)


    In safety engineering terms, this is dangerous because it’s hard to detect, hard to reproduce, and can directly change control decisions. A single off-by-one error, writing just one extra element can be enough to flip a flag or alter a threshold and change system behavior at exactly the wrong moment.


  2. Use-after-free and double-free: undefined behavior in disguise 


    Consider a medical ventilator (or any life-support device) where one software module reads sensor data (airway pressure, flow rate, oxygen concentration) and another module displays trends and raises alarms. To pass data between modules, the system may use a shared “data packet” structure in memory.


    Now imagine a bug in the handoff: The sensor "task" creates a data packet and publishes it to a queue. The display/alarm "task" reads the packet. But due to a timing error, one of 2 things can happen


    • Use-after-free: the sensor "task" frees and reuses that same data packet too early for another task. This means, the display/alarm "task" reads the packet after it has already been freed and reused for something else. i.e. it has not read the original data packet; instead, it is reading a new data packet.


    • Double-free: receiver task (display / alarm task) frees the memory (after reading the data), and then the sender task (sensor task) frees it again - the same memory is freed twice. The memory manager assumes that 2 chunks of data are available, when only 1 chunk is available.


    Why this becomes a safety issue: Most of the time, it works fine, especially in a lab where timing is stable. But in the field (temperature changes, CPU load, different sensor noise patterns), the exact timing can shift. Once in a while, the alarm task reads “pressure = 0” or a wildly incorrect value because the memory now contains unrelated data.


    What that looks like operationally:

    • A false alarm (distracts clinicians / triggers unnecessary intervention), or worse

    • A missed alarm (dangerous condition not detected), or

    • A sporadic reset/fault that appears “random”


    This is what makes use-after-free and double-free bugs so difficult: they often depend on rare timing windows and may only show up after long runtimes or under specific loads, meaning they can escape testing and become intermittent, hard-to-diagnose field failures.


How to solve them: practical defenses that work - coming soon


Comments


bottom of page