Calculating End-to-End Latency for Eventhub Triggered Azure functions with App Insights

Shervyna Ruan
7 min readNov 10, 2020

--

One of the benchmarks that people are usually interested in their system in terms of performance is the end-to-end latency. In our project, the system is mainly composed of Eventhubs and Event-triggered Azure Functions. To keep it simple, let's assume our system only have 2 event-triggered azure functions.

In a later phase of our project, we were also interested in how long it would take for an event to flow from the start of the system to the end. Our original idea was to add timestamps to every event and log the calculated time difference at the end of the system to App Insights where we can further calculate the average latency with some Kusto queries.

When we looked into App Insights, however, we found some features that “seem like” we can get a hint of the end-to-end latency without extra effort. Let’s look into each of them and see whether that really helps to calculate the latency or not.

Contents:

  1. Can Application Map be used to calculate end-to-end latency?
  2. Can we use the default correlated logs in App Insight to calculate e2e latency?
  3. Calculating latency with your own logs
  4. Conclusion

1. Can Application Map be used to calculate end-to-end latency?

-> No. While Application Map shows the duration of each component, it is not used for calculating end-to-end latency. Application Map is helpful when you want to easily identify the bottleneck of your system.

Application Map

At first sight, it may be tempting to think that we can get the end-to-end latency by just adding up these number of durations. However, if you check the end-to-end transaction detail, you will realize that depending on how you implement the logic, function duration can include the dependency call duration, or even last longer than the second function. Also, “time spent in queue”(third bar in the screenshot) will not be reflected in the Application Map. Therefore, calculating end-to-end latency with Application Map can be tricky since there might be some actual overlaps or latency that are not reflected.

End-to-end Transaction Details

2. Can we use the default correlated logs in App Insight to calculate e2e latency?

-> It depends on the definition of the latency that you want to measure. You can only calculate latency from the start of function execution, instead of event enqueued time with default logs.

You might have noticed already, from screenshots of the Application Map and end-to-end transaction details posted above, that logs are correlated throughout both functions by default. Check out my other post on how this distributed tracing works in the back :)

Here are the logs that are logged by Azure functions by default (2 scenarios):

  1. When the second function processes only one event in that invocation: (The first function and second function will have the same operation Id.)

2. When the second function processes a batch of events in one invocation:(The first function and second function will have different operation id, but there will be an “operation links” column in the second function’s request telemetry. “Operation links” tells you the operation Id of where each event came from.)

Therefore, by using these default logs, you can calculate end-to-end latency by doing “second function end time ー first function start time”. In order to make this calculation, you also need to correlate the logs of the second function and the first function. Here is a sample Kusto query:

let start = view(){
requests
| where operation_Name == "FirstFunction"
| project start_t = timestamp, first_operation_Id = operation_Id
};
let link = view(){
requests
| where operation_Name == "SecondFunction"
| mv-expand ex = parse_json(tostring(customDimensions["_MS.links"]))
| extend parent = case(isnotempty(ex.operation_Id), ex.operation_Id, operation_Id )
| project first_operation_Id = parent, second_operation_Id = operation_Id
};
let finish = view(){
traces
| where customDimensions["EventName"] == "FunctionCompleted" and operation_Name == "SecondFunction"
| project end_t = timestamp, second_operation_Id = operation_Id
};
start
| join kind=inner (
link
| join kind=inner finish on second_operation_Id
) on first_operation_Id
| project start_t, end_t, first_operation_Id, second_operation_Id
| summarize avg(datetime_diff('second', end_t, start_t))

What this Kusto query actually does is that it will expand the operation links(if any) in the second function’s request, and map the second function’s end time to its corresponding first function’s start time. When there are no operation links, it means that the function only processed one event during that invocation, so the first function’s operation id will just be the same as the second function.

In fact, you might also realize that there is an “enqueueTimeUtc” property in the trigger detail trace. You might be wondering, why does this section claim that we can only calculate latency from function start time instead of event enqueue time if the enqueued time is also logged by default? The answer is, this “enqueueTimeUtc” is extracted from only the first event of each of the batch that function processed. So, if the maxbatchsize of your function is set to 1, then you might be able to calculate “second function endsーevent enqueue time”. However, in most cases, it is more efficient to process events in batches, so in those cases, it does not make much sense to use the first event’s enqueue time for the rest of the events in the same batch.

only the first event’s enqueuedTImeUtc is used to set the property

3. Calculating latency with your own logs

While the function start time to function end time’s latency gives you some idea of a part of your system, some projects might need to know more about the latency that is not covered by what was logged by default. For instance, as mentioned in the previous section, you might want to define the end-to-end latency as “second function end ー event enqueue time”. You might even want to define end to end latency based on some business metrics. In those cases, you can include timestamps of the start time in each event, and log the calculated latency of each event at the end of your system.

One thing to remember when doing so is to log the latency with “structured logging”. For example:

double latency = (nowTimeUTC - enqueuedTime).TotalMilliseconds;
log.LogInformation("latency: {latency}", latency);

This way, you will be able to query the property easily in App Insights by using “customDimensions.prop__{yourProperty}”. Beware that there are two underscores in this expression.

Kusto query to calculate the latency based on your logs:

traces
| where operation_Name == "SecondFunction" and isnotempty(customDimensions.prop__latency)
| summarize avg(todouble(customDimensions.prop__latency))

By using Kusto query, you can also draw some charts to gain some insight into how the latency changes from time to time:

In addition, you can use the sampling feature of App Insights to reduce the logs that are sent to App Insights. You might not want to calculate latency for EVERY event if you got hundreds of thousands of events.

4. Conclusion

By using default Azure Function logs in App Insight, you can calculate a part of the system latency without adding extra logs in your code. However, since the default logs only contain the timestamp for function start time, you will still have to log your own calculated latency if you want to caculate latency from a different starting point such as event enqueued time. Anyways, I hope this blog gives you a better understanding of the logs in App Insight, and introduced some of the concepts like kusto query, sampling, structured logging to you.

That’s it! Thanks for reading :)

Related blog: Distributed Tracing Deep Dive for Eventhub-triggered Azure Function in App Insights

--

--

No responses yet