A Heisenbug in the Sieve

In Castle, one can dynamically connect components and send events over those connections. Typically, this is an action on an incoming message (see: The Actor Model). And depending on ‘The Machinery (ToDo)’, those events can be queued. It is this combination that can result in a beautiful Heisenbug.

First, let’s explain the Heisenbug, before we give an example. Then we analyze it, show how to improve the code, and finally formulate a requirement to prevent & detect this kind of bug in Castle.

What is a Heisenbug?

The heisenbug is named after the theoretical physics Werner Heisenberg, who described the “observer effect”: when you look closely, the behaviour changes. The same can happen to software (bugs). The behaviour apparently changes when you study -or slightly adjust- that code. Often, this is due to small changes in timing, possibly even in generated code. Therefore old (old-fashioned) sequential code on slow CPUs is less likely to have heisenbugs than concurrent code on fast multi-core systems. It’s also common in threaded programs.

The Sieve goes wrong

My standard example, The Sieve (basic variant), suffered from this issue. The initial version did work for years but failed horribly when another “machinery” was used. After studying this, the bug is simple and easy to fix.

There are two related timing issues that together result in the Heisenbug. First, we introduce them and then show how the combination may fail.

Event-order

Conceptually, the Generator sends (events with) integers to Sieve(2), which may forward them to Sieve(3), then to Sieve(5), etc. As shown in the Conceptual sidebar, we probably assume that each integer is sieved before the next ’int’ starts: the classic “sequential view” we are used to.

However, that isn’t how it works. In Castle, only the order of events on a single connection is defined (one by one, sequential). And given the code, the integer sent by a Sieve is always later than the incoming one. That is all we may assume.
The timing of unrelated events on multiple connections is not defined. That order may depend on The Machinery (ToDo) and many other factors. Do not assume any order – as I did!

The diagram in the One-by-One sidebar shows another (allowed) order. The Generator outputs all events first before Sieve(2) starts filtering. Then, the odd integers are passed to Sieve(3), which processes all its input, then Sieve(5), etc.
Although we aren’t using concurrency, and it needs tremendous buffers when finding big primes, it does conceptually work. And so, it is an allowed execution [2].

As that order is “possible” we should have a design (and code) that handles it. But it doesn’t … (in the basic variant).

Reconnecting

The chain of Sieves will grow as we find more primes. Whenever an int isn’t filtered out and reaches the Finder a new prime is found. Then, a new Sieve element is created and inserted into the chain.
Building the chain is done by the Main component (which is signalled by the Finder) [1].

Therefore, Main remembers the last_sieve and reconnects that output to the newly created Sieve. Which output is temporally connected to the Finder
For every newly found prime, this is repeated.

This detail is shown in the “with Details” sidebar diagram; where the Finder and Main and all its messages are shown too.

Assuming the initial “conceptual” order, you will see the same Sieve(s) become alive (“new” message), and are added to the end of the sieve chain. The integers still flow (now, shown as “try(int)” messages) by this sieve.
You will also notice the Finder does indeed find all primes.

The combination

Now let us study how the sieve chain will grow with a “fast generator”, and the one-by-one order of events is used. This diagram is shown below.

As we can see in the picture, it goes dreadfully wrong. No proper chain is created, and we will find integers like 4 and 6. This is wrong, they are not prime.
With a (very) fast Generator, all integers are sent to the Finder –before any Sieve is created, and so any int is reported as prime. besides, too many elements are added to the chain as a Sieve component is created for each found “prime”. On top of that, no integer is ever sieved…

This is just an example. As we can’t predict (or assume) any order, we can find other results too. And, when we add “debugging print statement” (and look closer), we change the timing and will find other results. We found the observer effect!

' (C) 2023,2024 Albert Mietus. Part of CCastle project
@startuml
'hide footbox
title Wrong (Generate fast & Slow creation)

participant  Main        as  M
participant  Finder      as  F
participant "Generator"  as  G
participant "Sieve(2)"   as S2
participant "Sieve(3)"   as S3
participant "Sieve(4)"   as S4 #Red\IndianRed
participant "Sieve(5)"   as S5
participant "Sieve(6)"   as S6 #Red\IndianRed
participant "Sieve(7)"   as S7

... Suppose, the Generator is active first ...

G -[#green]\\   F: try(2)
G -[#purple]\\  F: try(3)
G -[#purple]\\  F: try(4)
G -[#purple]\\  F: try(5)
G -[#purple]\\  F: try(6)
G -[#purple]\\  F: try(7)

... And then the Finder ...

F -[#green]\  M: found(2)
F -[#purple]\ M: found(3)
note right: Despite 3 is a prime, it never passed //Sieve(2//. <color:red>Wrong</color>
F -[#red]\    M: found(4)
note right  #IndianRed/Red: Found prime: <color:white>**4**</color> -- wrong result
F -[#purple]\ M: found(5)
note right: Again, 5 is a prime, but it's **not** //sieved//. <color:red>Wrong</color>
F -[#red]\    M: found(6)
note right  #IndianRed/Red: Found prime: <color:white>**6**</color> -- wrong result
F -[#purple]\ M: found(7)
note right: Again, 7 is a prime, but .... <color:red>wrong</color>

... And last: we create the  Sieves & reconnect ...

M -\ S2: new
activate S2
G \\--o M: "reconneced to Sieve(2)"
hnote right: This reconnect is to late, as sereval messages are already sent


M -[#purple]\ S3: new
activate S3
S2 \\--o M: "reconneced to Sieve(3)"
hnote right: This reconnect is to late, as sereval messages are already sent


M -[#red]\ S4: new
activate S4
note right: Another flaw: We shouldn't have **Sieve(4)**
S3 \\--o M: "reconneced to Sieve(4)"


M -[#purple]\ S5: new
activate S5
hnote right: Sieve(5) never gets input; or at least to little
S4 \\--o M: "reconneced to Sieve(5)"


M -[#red]\ S6: new
activate S6
note right: Another flaw: We shouldn't have **Sieve(6)**
S5 \\--o M: "reconneced to Sieve(6)"


M -[#red]\ S7: new
activate S7
S6 \\--o M: "reconneced to Sieve(6)"
hnote right: Again: too late

== etc ==

@enduml

Warning

It is not “the timing” that is wrong!
A concurrent program is only correct when it goes right for any possible timing.

As in all software engineering, we can prove it is buggy when it goes wrong at least once. That is what we have shown. And so, the original code is wrong.

How to improve?

Finding this heisenbug triggered an investigation to improve Castle as well as ‘The Sieve (basic variant)’. Where our goal is not just to improve the sieve code, but all programs. And give the programmer options to prevent heisenbugs.

As we will see fixing ‘the sieve is simple: use the SLowStart (base) protocol. It changes only three lines!
But we start with a new requirement, for a new tool in the Castle Workshop.

Simulation

As we will see below the SlowStart protocol will remove our Heisenbug. But it does not abolish the Heisenbug!

The feature does by no means prevent a developer from using other solutions, with a comparable flaw. Besides, it’s hopeless to use testing to prove the absence of a Heisenbug. As we have seen above, all variants of all possible concurrent execution orders should be correct. Where we have very limited control over which variant is used.

This led to a new requirement: Simulate & Verify all possi... (U_Tools_EventOrder.) To assist the developer the Castle Workshop will come with tools to detect such bugs. See Simulation/Verification (todo) for more on this.

SlowStart

Castle (now) comes [3] with the parametric base protocol SlowStart, which is based on TCP Slow start and contains a queueing model [4] to control the speed of an (event) connection. As the name suggests, initially the connections with be slow. Roughly, the parameter set the maximal number of unhandled events on a connection.
The (improved) version of The Sieve (basic variant) uses a SlowStart of 1. And Main will remove (or increase) that limit after reconnecting.

Initially, the Generator is only allowed to send one event, which is received by the Finder. Then, Main will create the first Sieve (Sieve(2)), reconnect the Generator to that component, and increase the “speed” of the connection. As the connection Generator->Sieve(2) is stable, the is no need to limit the “queue”.

The Generator acts as before: it sends (events with) integers over its output. But now, the SimpleSieve protocol can slow down the Generator –no code changes are needed for this. This may happen when the second event is sent before it is received (or “handled”) by the Finder, and the limit is set to 1.
As this limit is removed when a Sieve-component is inserted into the chain, only the start is slow…

The same happens to every Sieve: initially (when it is connected to the Finder) there is a limit of 1 event in the queue. But when that connection is reconnected –and the potential Heisenbug is gone– the limit is removed.

Tip

Unlimited or better?

In this blog, we remove the limit of the SlowStart protocol completely, for simplicity. Then the Heisenbug is solved.

That is not the only option.
Given the Generator is a simple loop it can produce many integers fast. And so causes huge piles of queued events. That can be done better, by the same concept: a maximal queue size. (again: just model).

It’s up to the developer to optimize this. Some prefer the maximum queue length equal to the number of Sieve-components, others relate the maximum queue length to the number of available cores. Some use static values, others will adjust them over the run-time of the application.
That is all possible with a few lines in the Main component. But also the Sieve component can set this limit, both for incoming ports as well for outgoing ports.


Footnotes

Comments

comments powered by Disqus