Eventhub triggered Azure function: Replays and Retries

Shervyna Ruan
5 min readOct 28, 2020

--

In our project, we were investigating our load test results on our eventhub triggered azure functions. We had some performance issues, and this led us to think about whether azure function actually has some retry or replay logic under the hood that made the function to process more load than expected. Although it turns out that neither the retry nor replay logic was causing the issue, I would still like to share our learnings on how eventhub triggered azure function behaves in terms of replays and retries.

Content:

  1. Conclusions first :)
  2. Where does azure function store the checkpoints of events?
  3. Replays:
    - How does BatchCheckpointFrequency affect replay behaviors?
  4. Retries:
    - Does eventhub triggered azure function retry on failure?
    - How to implement your own retry policy?

1. Conclusions first :)

Eventhub triggered azure function does not have any retry logic by default. Even if an error happened while processing the event, azure function will still mark that event as processed(if batchCheckpointFrequency is 1). You might lose some of your events if an unexpected error happened in the middle of processing a batch. However, setting batchCheckpointFrequency larger than 1 can be helpful in the case of app crashes. When azure function restarts, it can replay from previous batches if you set batchCheckpointFrequency to larger than 1. During development phase, it is probably better to leave batchCheckpointFrequency as 1 to avoid some confusions that may be caused by the replay. If you wonder why you kept getting same events again and again whenever you restart your function, check your batchCheckpointFrequency in host.json. In production, setting batchCheckpointFrequency to larger than 1 can help reduce data loss, and potentially improve some performance by making less calls to update the checkpoints in storage.

2. Where does azure function store the checkpoints of events?

Go to the storage which you set for AzureWebJobsStorage in function app settings -> Blob containers -> azure-webjobs-eventhub -> your eventhub namespace -> your eventhub -> your consumer group. There, you will see a list of files that stores the checkpoint information for each partition.

Checkpoint files for my event hub with 2 partitions.

Click on edit and you can see that checkpoint looks like the following:

{“Offset”:”672",”SequenceNumber”:4,”PartitionId”:”0",”Owner”:”xxxxxxxxxxxxx–xxxx-xxxx–xxxxxxxxxxxx”,”Token”:”xxxxxxxx–xxxx–xxxx–xxxx–xxxxxxxxxxxx",”Epoch”:4}

Try sending some events to your eventhub and you will see that sequenceNumber increases as azure function consumes the message (if batchCheckpointFrequency is 1).

https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.eventhubs.processor.lease?view=azure-dotnet-legacy#properties

3. Replays:

How does BatchCheckpointFrequency affect replay behaviors?

[Def.] BatchCheckpointFrequency: The number of event batches to process before creating an EventHub cursor checkpoint. Default = 1.

Replay happens when azure function processed an event, but the checkpoint is not updated. When azure function restarts, it reads from the previous checkpoint and processed the same events again.

Let’s say you have this setting in host.json, where batchCheckpointFrequency is set to 2 and maxBatchSize is set to 1 (so that azure function will only have one event in each batch):

....."extensions": {
"eventHubs": {
"batchCheckpointFrequency": 2,
"eventProcessorOptions": {
"maxBatchSize": 1,
"prefetchCount": 10
}
}
},
....

Then, you send 4 events to event hub. Azure function will then retrieve one event at a time(maxBatchSize = 1). In total, azure function will process 4 batches, and you might think that since batchCheckpointFrequency = 2, all events should be marked as processed. However, batchCheckpointFrequency marks a checkpoint for every N batches per partition. Events will be send to each partition in a round robin fashion if partition key is not specified, so if the events get distributed as below (just an example), only event 0 and event 3 will be marked as processed. When you restart your azure function, you will see event 1 and 2 being replayed.

Let’s try sending more events in one batch, and increase the maxBatchSize:

....."extensions": {
"eventHubs": {
"batchCheckpointFrequency": 2,
"eventProcessorOptions": {
"maxBatchSize": 10,
"prefetchCount": 10
}
}
},
....

The current checkpoint of partition 0 is sequenceNumber:58. This time, I will send 10 events in a batch to eventhub.

Notice that the events sent as a batch to event hub is not the same as the number of events that azure function will receive in a batch. I sent 10 events as a batch to eventhub, and with maxBatchSize set to 10, azure function still processed them in 3 batches.

Since all 10 events are sent to partition 0 as a batch and batchCheckpointFrequency is set to 2, azure function will mark the first two batches as processed(until event 65) in storage:

If you restart your function, you will see that the last batch (event 66~68) is replayed.

Therefore, when batchCheckpointFrequency is not set to 1, you might see some replay behaviors when you restart the function.

4. Retries:

Does eventhub triggered azure function retry on failure?

The answer is no (as of Oct 26th, 2020). However, looks like the retry feature is in PR, so this could be updated at any time.

When errors occur in a stream, if you decide to keep the pointer in the same spot, event processing is blocked until the pointer is advanced. In other words, if the pointer is stopped to deal with problems processing a single event, the unprocessed events begin piling up.

Azure Functions avoids deadlocks by advancing the stream’s pointer regardless of success or failure. Since the pointer keeps advancing, your functions need to deal with failures appropriately.

Ref: https://docs.microsoft.com/en-us/azure/azure-functions/functions-reliable-event-processing

How to implement your own retry policy?

To implement your own retry logic, you can look into the Polly library(mentioned in the doc above). Here’s a simple example:

var policy = Policy.Handle<Exception>().Retry(3);policy.Execute(() =>{ // some logic });

Link to an example of using Polly in azure functions

That’s it! Thanks for reading :)

--

--