Random wanderings through Microsoft Azure esp. PaaS plumbing, the IoT bits, AI on Micro controllers, AI on Edge Devices, .NET nanoFramework, .NET Core on *nix and ML.NET+ONNX
When in high power mode more cores are used but this consumes more resource when building the ONNXRuntime. To limit resource utilisation --parallel2 was added the command line because the compile process was having “out of memory” failures.
Getting CUDA and TensorRT working on the Nvidia Jetson Orin 8G took much longer than I expected, with many dead ends and device factory resets before the process was repeatable.
The testing consisted of permutations of three models TennisBallsYoloV8s20240618640×640.onnx, TennisBallsYoloV8s2024062410241024.onnx & TennisBallsYoloV8x20240614640×640 (limited testing as slow) and three images TennisBallsLandscape640x640.jpg, TennisBallsLandscape1024x1024.jpg & TennisBallsLandscape3072x4080.jpg.
Executive Summary
As expected, inferencing with a TensorRT 640×640 model and a 640×640 image was fastest, 9mSec pre-processing, 21mSec inferencing, then 4mSec post-processing.
If the image had to be scaled with SixLabors.ImageSharp this significantly increased the preprocessing (and overall) time.
When running the YoloV8 Coprocessor demonstration on the Nividia Jetson Orin inferencing looked a bit odd, the dotted line wasn’t moving as fast as expected. To investigate this further I split the inferencing duration into pre-processing, inferencing and post-processing times. Inferencing and post-processing were “quick”, but pre-processing was taking longer than expected.
YoloV8 Coprocessor application running on Nvidia Jetson Orin
When I ran the demonstrationUltralytics YoloV8object detection console application on my development desktop (13th Gen Intel(R) Core(TM) i7-13700 2.10 GHz with 32.0 GB) the pre-processing was much faster.
The much shorter pre-processing and longer inferencing durations were not a surprise as my development desktop does not have a Graphics Processing Unit(GPU)
Test image used for testing on Jetson device and development PC
The test image taken with my mobile was 3606×2715 pixels which was representative of the security cameras images to be processed by the solution.
public static void ProcessToTensor(Image<Rgb24> image, Size modelSize, bool originalAspectRatio, DenseTensor<float> target, int batch)
{
var options = new ResizeOptions()
{
Size = modelSize,
Mode = originalAspectRatio ? ResizeMode.Max : ResizeMode.Stretch,
};
var xPadding = (modelSize.Width - image.Width) / 2;
var yPadding = (modelSize.Height - image.Height) / 2;
var width = image.Width;
var height = image.Height;
// Pre-calculate strides for performance
var strideBatchR = target.Strides[0] * batch + target.Strides[1] * 0;
var strideBatchG = target.Strides[0] * batch + target.Strides[1] * 1;
var strideBatchB = target.Strides[0] * batch + target.Strides[1] * 2;
var strideY = target.Strides[2];
var strideX = target.Strides[3];
// Get a span of the whole tensor for fast access
var tensorSpan = target.Buffer;
// Try get continuous memory block of the entire image data
if (image.DangerousTryGetSinglePixelMemory(out var memory))
{
Parallel.For(0, width * height, index =>
{
int x = index % width;
int y = index / width;
int tensorIndex = strideBatchR + strideY * (y + yPadding) + strideX * (x + xPadding);
var pixel = memory.Span[index];
WritePixel(tensorSpan.Span, tensorIndex, pixel, strideBatchR, strideBatchG, strideBatchB);
});
}
else
{
Parallel.For(0, height, y =>
{
var rowSpan = image.DangerousGetPixelRowMemory(y).Span;
int tensorYIndex = strideBatchR + strideY * (y + yPadding);
for (int x = 0; x < width; x++)
{
int tensorIndex = tensorYIndex + strideX * (x + xPadding);
var pixel = rowSpan[x];
WritePixel(tensorSpan.Span, tensorIndex, pixel, strideBatchR, strideBatchG, strideBatchB);
}
});
}
}
private static void WritePixel(Span<float> tensorSpan, int tensorIndex, Rgb24 pixel, int strideBatchR, int strideBatchG, int strideBatchB)
{
tensorSpan[tensorIndex] = pixel.R / 255f;
tensorSpan[tensorIndex + strideBatchG - strideBatchR] = pixel.G / 255f;
tensorSpan[tensorIndex + strideBatchB - strideBatchR] = pixel.B / 255f;
}
For a 3606×2715 image the WritePixel method would be called tens of millions of times so its implementation and the overall approach used for ProcessToTensor has a significant impact on performance.
YoloV8 Coprocessor application running on Nvidia Jetson Orin with a resized image
Resizing the images had a significant impact on performance on the development box and Nividia Jetson Orin. This will need some investigation to see how much reducing the resizing the images impacts on the performance and accuracy of the model.
Generating the TensorRT engine every time the application is started
The TensorRT Execution provider has a number of configuration options but the IYoloV8Builder interface had to modified with UseCuda, UseRocm, UseTensorrt and UseTvm overloads implemented to allow additional configuration settings.
...
public class YoloV8Builder : IYoloV8Builder
{
...
public IYoloV8Builder UseOnnxModel(BinarySelector model)
{
_model = model;
return this;
}
#if GPURELEASE
public IYoloV8Builder UseCuda(int deviceId) => WithSessionOptions(SessionOptions.MakeSessionOptionWithCudaProvider(deviceId));
public IYoloV8Builder UseCuda(OrtCUDAProviderOptions options) => WithSessionOptions(SessionOptions.MakeSessionOptionWithCudaProvider(options));
public IYoloV8Builder UseRocm(int deviceId) => WithSessionOptions(SessionOptions.MakeSessionOptionWithRocmProvider(deviceId));
// Couldn't test this don't have suitable hardware
public IYoloV8Builder UseRocm(OrtROCMProviderOptions options) => WithSessionOptions(SessionOptions.MakeSessionOptionWithRocmProvider(options));
public IYoloV8Builder UseTensorrt(int deviceId) => WithSessionOptions(SessionOptions.MakeSessionOptionWithTensorrtProvider(deviceId));
public IYoloV8Builder UseTensorrt(OrtTensorRTProviderOptions options) => WithSessionOptions(SessionOptions.MakeSessionOptionWithTensorrtProvider(options));
// Couldn't test this don't have suitable hardware
public IYoloV8Builder UseTvm(string settings = "") => WithSessionOptions(SessionOptions.MakeSessionOptionWithTvmProvider(settings));
#endif
...
}
...
YoloV8Builder builder = new YoloV8Builder();
builder.UseOnnxModel(_applicationSettings.ModelPath);
if (_applicationSettings.UseTensorrt)
{
Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss.fff} Using TensorRT");
OrtTensorRTProviderOptions tensorRToptions = new OrtTensorRTProviderOptions();
Dictionary<string, string> optionKeyValuePairs = new Dictionary<string, string>();
optionKeyValuePairs.Add("trt_engine_cache_enable", "1");
optionKeyValuePairs.Add("trt_engine_cache_path", "enginecache/");
tensorRToptions.UpdateOptions(optionKeyValuePairs);
builder.UseTensorrt(tensorRToptions);
}
...
In order to validate that the loaded engine loaded from the trt_engine_cache_path is usable for the current inference, an engine profile is also cached and loaded along with engine
If current input shapes are in the range of the engine profile, the loaded engine can be safely used. If input shapes are out of range, the profile will be updated and the engine will be recreated based on the new profile.
Reusing the TensorRT engine built the first time the application is started
When the YoloV8.Coprocessor.Detect.Image application was configured to use NVIDIA TensorRT and the engine was cached the average inference time was 58mSec and the Build method took roughly 10sec to execute after the application had been run once.
I configured the demonstrationUltralytics YoloV8object detection(yolov8s.onnx) console application to process a 1920×1080 image from a security camera on my desktop development box (13th Gen Intel(R) Core(TM) i7-13700 2.10 GHz with 32.0 GB)
Object Detection sample application running on my development box
A Seeedstudio reComputerJ3011 uses a Nividia Jetson Orin 8G and looked like a cost-effective platform to explore how a dedicated Artificial Intelligence (AI) co-processor could reduce inferencing times.
To establish a “baseline” I “published” the demonstration application on my development box which created a folder with all the files required to run the application on the Seeedstudio reComputerJ3011 ARM64 CPU. I had to manually merge the “User Secrets” and appsettings.json files so the camera connection configuration was correct.
The runtimes folder contained a number of folders with the native runtime files for the supported Open Neural Network Exchange(ONNX) platforms
This Nividia Jetson Orin ARM64 CPU requires the linux-arm64 ONNX runtime which was “automagically” detected. (in previous versions of ML.Net the native runtime had to be copied to the execution directory)
Object Detection sample application running on my Seeedstudio reComputer J3011
When I averaged the pre-processing, inferencing and post-processing times for both devices over 20 executions my development box was much faster which was not a surprise. Though the reComputer J3011 post processing times were a bit faster than I was expecting
ARM64 CPU Preprocess 0.05s Inference 0.31s Postprocess 0.05
private static async void ImageUpdateTimerCallback(object state)
{
DateTime requestAtUtc = DateTime.UtcNow;
// Just incase - stop code being called while photo already in progress
if (_cameraBusy)
{
return;
}
_cameraBusy = true;
Console.WriteLine($"{DateTime.UtcNow:yy-MM-dd HH:mm:ss} Image processing start");
try
{
#if SECURITY_CAMERA
Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss:fff} Security Camera Image download start");
using (Stream cameraStream = await _httpClient.GetStreamAsync(_applicationSettings.CameraUrl))
using (Stream fileStream = File.Create(_applicationSettings.ImageInputFilenameLocal))
{
await cameraStream.CopyToAsync(fileStream);
}
Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss:fff} Security Camera Image download done");
#endif
List<YoloPrediction> predictions;
// Process the image on local file system
using (Image<Rgba32> image = await Image.LoadAsync<Rgba32>(_applicationSettings.ImageInputFilenameLocal))
{
Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss:fff} YoloV5 inferencing start");
predictions = _scorer.Predict(image);
Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss:fff} YoloV5 inferencing done");
#if OUTPUT_IMAGE_MARKUP
Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss:fff} Image markup start");
var font = new Font(new FontCollection().Add(_applicationSettings.ImageOutputMarkupFontPath), _applicationSettings.ImageOutputMarkupFontSize);
foreach (var prediction in predictions) // iterate predictions to draw results
{
double score = Math.Round(prediction.Score, 2);
var (x, y) = (prediction.Rectangle.Left - 3, prediction.Rectangle.Top - 23);
image.Mutate(a => a.DrawPolygon(Pens.Solid(prediction.Label.Color, 1),
new PointF(prediction.Rectangle.Left, prediction.Rectangle.Top),
new PointF(prediction.Rectangle.Right, prediction.Rectangle.Top),
new PointF(prediction.Rectangle.Right, prediction.Rectangle.Bottom),
new PointF(prediction.Rectangle.Left, prediction.Rectangle.Bottom)
));
image.Mutate(a => a.DrawText($"{prediction.Label.Name} ({score})",
font, prediction.Label.Color, new PointF(x, y)));
}
await image.SaveAsJpegAsync(_applicationSettings.ImageOutputFilenameLocal);
Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss:fff} Image markup done");
#endif
}
#if PREDICTION_CLASSES
Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss:fff} Image classes start");
foreach (var prediction in predictions)
{
Console.WriteLine($" Name:{prediction.Label.Name} Score:{prediction.Score:f2} Valid:{prediction.Score > _applicationSettings.PredictionScoreThreshold}");
}
Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss:fff} Image classes done");
#endif
#if PREDICTION_CLASSES_OF_INTEREST
IEnumerable<string> predictionsOfInterest = predictions.Where(p => p.Score > _applicationSettings.PredictionScoreThreshold).Select(c => c.Label.Name).Intersect(_applicationSettings.PredictionLabelsOfInterest, StringComparer.OrdinalIgnoreCase);
if (predictionsOfInterest.Any())
{
Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss} Camera image comtains {String.Join(",", predictionsOfInterest)}");
}
#endif
}
catch (Exception ex)
{
Console.WriteLine($"{DateTime.UtcNow:yy-MM-dd HH:mm:ss} Camera image download, upload or post procesing failed {ex.Message}");
}
finally
{
_cameraBusy = false;
}
TimeSpan duration = DateTime.UtcNow - requestAtUtc;
Console.WriteLine($"{DateTime.UtcNow:yy-MM-dd HH:mm:ss} Image processing done {duration.TotalSeconds:f2} sec");
Console.WriteLine();
}
The names of the input image, output image and yoloV5 model file are configured in the appsettings.json (on device) or secrets.json (Visual Studio 2022 desktop) file. The location (ImageOutputMarkupFontPath) and size (ImageOutputMarkupFontSize) of the font used are configurable to make it easier run the application on different devices and operating systems.
This post builds on my Smartish Edge Camera -Azure IoT Direct Methods post adding two updateable properties for the image capture and processing timer the due and period values. The two properties can be updated together or independently but the values are not persisted.
When I was searching for answers I found this code in many posts and articles but it didn’t really cover my scenario.
private static async Task OnDesiredPropertyChanged(TwinCollection desiredProperties,
object userContext)
{
Console.WriteLine("desired property chPleange:");
Console.WriteLine(JsonConvert.SerializeObject(desiredProperties));
Console.WriteLine("Sending current time as reported property");
TwinCollection reportedProperties = new TwinCollection
{
["DateTimeLastDesiredPropertyChangeReceived"] = DateTime.Now
};
await Client.UpdateReportedPropertiesAsync(reportedProperties).ConfigureAwait(false);
}
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
_logger.LogInformation("Azure IoT Smart Edge Camera Service starting");
try
{
#if AZURE_IOT_HUB_CONNECTION
_deviceClient = await AzureIoTHubConnection();
#endif
#if AZURE_IOT_HUB_DPS_CONNECTION
_deviceClient = await AzureIoTHubDpsConnection();
#endif
#if AZURE_DEVICE_PROPERTIES
_logger.LogTrace("ReportedPropeties upload start");
TwinCollection reportedProperties = new TwinCollection();
reportedProperties["OSVersion"] = Environment.OSVersion.VersionString;
reportedProperties["MachineName"] = Environment.MachineName;
reportedProperties["ApplicationVersion"] = Assembly.GetAssembly(typeof(Program)).GetName().Version;
reportedProperties["ImageTimerDue"] = _applicationSettings.ImageTimerDue;
reportedProperties["ImageTimerPeriod"] = _applicationSettings.ImageTimerPeriod;
reportedProperties["YoloV5ModelPath"] = _applicationSettings.YoloV5ModelPath;
reportedProperties["PredictionScoreThreshold"] = _applicationSettings.PredictionScoreThreshold;
reportedProperties["PredictionLabelsOfInterest"] = _applicationSettings.PredictionLabelsOfInterest;
reportedProperties["PredictionLabelsMinimum"] = _applicationSettings.PredictionLabelsMinimum;
await _deviceClient.UpdateReportedPropertiesAsync(reportedProperties, stoppingToken);
_logger.LogTrace("ReportedPropeties upload done");
#endif
_logger.LogTrace("YoloV5 model setup start");
_scorer = new YoloScorer<YoloCocoP5Model>(_applicationSettings.YoloV5ModelPath);
_logger.LogTrace("YoloV5 model setup done");
_ImageUpdatetimer = new Timer(ImageUpdateTimerCallback, null, _applicationSettings.ImageTimerDue, _applicationSettings.ImageTimerPeriod);
await _deviceClient.SetMethodHandlerAsync("ImageTimerStart", ImageTimerStartHandler, null);
await _deviceClient.SetMethodHandlerAsync("ImageTimerStop", ImageTimerStopHandler, null);
await _deviceClient.SetMethodDefaultHandlerAsync(DefaultHandler, null);
await _deviceClient.SetDesiredPropertyUpdateCallbackAsync(OnDesiredPropertyChangedAsync, null);
try
{
await Task.Delay(Timeout.Infinite, stoppingToken);
}
catch (TaskCanceledException)
{
_logger.LogInformation("Application shutown requested");
}
}
catch (Exception ex)
{
_logger.LogError(ex, "Application startup failure");
}
finally
{
_deviceClient?.Dispose();
}
_logger.LogInformation("Azure IoT Smart Edge Camera Service shutdown");
}
// Lots of other code here
private async Task OnDesiredPropertyChangedAsync(TwinCollection desiredProperties, object userContext)
{
TwinCollection reportedProperties = new TwinCollection();
_logger.LogInformation("OnDesiredPropertyChanged handler");
// NB- This approach does not save the ImageTimerDue or ImageTimerPeriod, a stop/start with return to appsettings.json configuration values. If only
// one parameter specified other is default from appsettings.json. If timer settings changed I think they won't take
// effect until next time Timer fires.
try
{
// Check to see if either of ImageTimerDue or ImageTimerPeriod has changed
if (!desiredProperties.Contains("ImageTimerDue") && !desiredProperties.Contains("ImageTimerPeriod"))
{
_logger.LogInformation("OnDesiredPropertyChanged neither ImageTimerDue or ImageTimerPeriod present");
return;
}
TimeSpan imageTimerDue = _applicationSettings.ImageTimerDue;
// Check that format of ImageTimerDue valid if present
if (desiredProperties.Contains("ImageTimerDue"))
{
if (TimeSpan.TryParse(desiredProperties["ImageTimerDue"].Value, out imageTimerDue))
{
reportedProperties["ImageTimerDue"] = imageTimerDue;
}
else
{
_logger.LogInformation("OnDesiredPropertyChanged ImageTimerDue invalid");
return;
}
}
TimeSpan imageTimerPeriod = _applicationSettings.ImageTimerPeriod;
// Check that format of ImageTimerPeriod valid if present
if (desiredProperties.Contains("ImageTimerPeriod"))
{
if (TimeSpan.TryParse(desiredProperties["ImageTimerPeriod"].Value, out imageTimerPeriod))
{
reportedProperties["ImageTimerPeriod"] = imageTimerPeriod;
}
else
{
_logger.LogInformation("OnDesiredPropertyChanged ImageTimerPeriod invalid");
return;
}
}
_logger.LogInformation("Desired Due:{0} Period:{1}", imageTimerDue, imageTimerPeriod);
if (!_ImageUpdatetimer.Change(imageTimerDue, imageTimerPeriod))
{
_logger.LogInformation("Desired Due:{0} Period:{1} failed", imageTimerDue, imageTimerPeriod);
}
await _deviceClient.UpdateReportedPropertiesAsync(reportedProperties);
}
catch (Exception ex)
{
_logger.LogError(ex, "OnDesiredPropertyChangedAsync handler failed");
}
}
The TwinCollection desiredProperties is checked for ImageTimerDue and ImageTimerPeriod properties and if either of these are present and valid the Timer.Change method is called.
Azure IoT Central SmartEdgeCamera Device template capabilities
I added a View to the template so the two properties could be changed (I didn’t configure either as required)
Azure IoT Central SmartEdgeCamera Device Default view designer
In the “Device Properties”, “Operation Tab” when I changed the ImageTimerDue and/or ImageTimerPeriod there was visual feedback that there was an update in progress.
Azure IoT Central SmartEdgeCamera Device Properties update start
This post builds on my Smartish Edge Camera – Azure IoT Direct Methods post adding a number of read only properties. In this version the application reports the OSVersion, MachineName, ApplicationVersion, ImageTimerDue, ImageTimerPeriod, YoloV5ModelPath, PredictionScoreThreshold, PredictionLabelsOfInterest, and PredictionLabelsMinimum.
Azure IoT Explorer displaying the reported “readonly” property values
Azure IoT Central Dashboard with readonly properties before UpdateReportedPropertiesAsync called
Azure IoT Central Telemetry displaying property update payloads
Azure IoT Central Dashboard displaying readonly properties
While testing the application I noticed the reported property version was increasing every time I deployed the application. I was retrieving the version information as the application started with AssemblyName.Version
Visual Studio 2019 Application Package information
I had also configured the Assembly Version in the SmartEdgeCameraAzureIoTService project Package tab to update the assembly build number each time the application was compiled. This was forcing an update of the reported properties version every time the application started
Azure IoT Central Template Direct Method configuration
Azure IoT Central Template Direct Method invocation
Azure Smart Edge Camera console application Start Direct Method call
Initially, I had one long post which covered Direct Methods, Readonly Properties and Updateable Properties but it got too long so I split it into three.