YoloV8-NuGet Performance X64 CPU

Posted on August 30, 2024 by devmobilenz

When checking the dme-compunet, YoloDotNet, and sstainba and NuGets I noticed YoloDotNet readme.md detailed some performance enhancements…

What’s new in YoloDotNet v2.0?

YoloDotNet 2.0 is a Speed Demon release where the main focus has been on supercharging performance to bring you the fastest and most efficient version yet. With major code optimizations, a switch to SkiaSharp for lightning-fast image processing, and added support for Yolov10 as a little extra 😉 this release is set to redefine your YoloDotNet experience:

Changing the implementation to use SkiaSharp caught my attention because in previous testing manipulating images with the Sixlabors.ImageSharp library took longer than expected.

I built a test rig for comparing the performance of the different NuGets using standard images and ONNX Models.

I started with the dme-compunet YoloV8 NuGet which found all the tennis balls and the results were consistent with earlier tests.

dme-compunet test harness image bounding boxes

The YoloDotNet by NickSwardh NuGet update had some “breaking changes” so I built “old” and “updated” test harnesses. The V1 version found all the tennis balls and the results were consistent with earlier tests.

NickSwardh V1 test harness image bounding boxes

The YoloDotNet by NickSwardh NuGet update had some “breaking changes” so there were some code changes but the V1 and V2 results were slightly different.

NickSwardh V2 test harness image bounding boxes

Even though the YoloV8 by sstainba NuGet hadn’t been updated I ran the test harness just in case and the results were consistent with previous tests.

sstainba test harness image bounding boxes

The dme-compunet YoloV8 and NickSwardh YoloDotNet V1 versions produce the same results, but the NickSwardh YoloDotNet V2 results were slightly different. The YoloV8 by sstainba results were unchanged.

dme-Compunet 71 mSec
NickSwardV1 76 mSec
NickSwardV2 33 mSecs
SStainba 82mSec

The NickSwardV2 implementation was significantly faster, but I need to investigate the slight difference in the bounding boxes. It looks like Sixlabors.ImageSharp might be the issue.

YoloV8 ONNX – Nvidia Jetson Orin Nano™ Execution Providers

Posted on July 15, 2024 by devmobilenz

The Seeedstudio reComputer J3011 has two processors an ARM64 CPU and an Nvidia Jetson Orin 8G which can be used for inferencing with the Open Neural Network Exchange(ONNX)Runtime.

Story of Fail

Inferencing worked first time on the ARM64 CPU because the required runtime is included in the Microsoft.ML.OnnxRuntime NuGet

Microsoft.ML.OnnxRuntime NuGet ARM64 Linux runtime

Inferencing failed on the Nividia Jetson Orin 8G because the CUDA Execution provider and TensorRT Execution Provider for the ONNXRuntime were not included in the Microsoft.ML.OnnxRuntime.GPU.Linux NuGet.

There were Linux x64 and Windows x64 versions of the ONNXRuntime library included in the Microsoft.ML.OnnxRuntime.Gpu NuGet

Microsoft.ML.OnnxRuntime.Gpu NuGet x64 Linux runtime

Desperately Seeking libonnxruntime.so

The Nvidia ONNX runtime site had pip wheel files for the different versions of Python and the Open Neural Network Exchange(ONNX)Runtime.

The onnxruntime_gpu-1.18.0-cp312-cp312-linux_aarch64.whl matched the version of the ONNXRuntime I needed and version of Python on the device..

When the pip wheel file was renamed onnxruntime_gpu-1.18.0-cp312-cp312-linux_aarch64.zip it could be opened, but there wasn’t a libonnruntime.so.

Onnxruntime_gpu-1.18.0-cp312-cp312-linux_aarch64 file listing

Building the TensorRT & CUDA Execution Providers

The ONNXRuntime build has to be done on Nividia Jetson Orin so after installing all the necessary prerequisites the first attempt failed.

bryn@ubuntu:~/onnxruntime/onnxruntime$ ./build.sh --config Release --update --build --build_wheel \
--use_tensorrt --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu \
--tensorrt_home /usr/lib/aarch64-linux-gnu

When in high power mode more cores are used but this consumes more resource when building the ONNXRuntime. To limit resource utilisation --parallel2 was added the command line because the compile process was having “out of memory” failures.

bryn@ubuntu:~/onnxruntime/onnxruntime$ ./build.sh --config Release --update --build --parallel 2 --build_wheel \
--use_tensorrt --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu \
--tensorrt_home /usr/lib/aarch64-linux-gnu

There were some compiler warnings but they appear to be benign.

First attempt at running the application failed because libonnxruntime.so was missing so –build_shared_lib was added to the command line

2024-06-10 18:21:58,480 build [INFO] - Build complete
bryn@ubuntu:~/onnxruntime/onnxruntime$ ./build.sh --config Release --update --build --parallel 2 --build_wheel --use_tensorrt --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu --tensorrt_home /usr/lib/aarch64-linux-gnu --build_shared_lib

When the build completed the files were copied to the runtime folder of the program.

The application could then be configured to use the TensorRT Execution Provider.

Getting CUDA and TensorRT working on the Nvidia Jetson Orin 8G took much longer than I expected, with many dead ends and device factory resets before the process was repeatable.

YoloV8 ONNX – Nvidia Jetson Orin Nano™ CPU & GPU TensorRT Inferencing

Posted on July 15, 2024 by devmobilenz

The Seeedstudio reComputer J3011 has two processors an ARM64 CPU and an Nividia Jetson Orin 8G. To speed up TensorRT inferencing I built an Open Neural Network Exchange(ONNX) TensorRT Execution Provider. After updating the code to add a “warm-up” and tracking of average pre-processing, inferencing & post-processing durations I did a series of CPU & GPU performance tests.

The testing consisted of permutations of three models TennisBallsYoloV8s20240618640×640.onnx, TennisBallsYoloV8s2024062410241024.onnx & TennisBallsYoloV8x20240614640×640 (limited testing as slow) and three images TennisBallsLandscape640x640.jpg, TennisBallsLandscape1024x1024.jpg & TennisBallsLandscape3072x4080.jpg.

Executive Summary

As expected, inferencing with a TensorRT 640×640 model and a 640×640 image was fastest, 9mSec pre-processing, 21mSec inferencing, then 4mSec post-processing.

If the image had to be scaled with SixLabors.ImageSharp this significantly increased the preprocessing (and overall) time.

CPU Inferencing

GPU TensorRT Small model Inferencing

GPU TensorRT Large model Inferencing

YoloV8 ONNX – Nvidia Jetson Orin Nano™ DenseTensor Performance

Posted on June 24, 2024 by devmobilenz

When running the YoloV8 Coprocessor demonstration on the Nividia Jetson Orin inferencing looked a bit odd, the dotted line wasn’t moving as fast as expected. To investigate this further I split the inferencing duration into pre-processing, inferencing and post-processing times. Inferencing and post-processing were “quick”, but pre-processing was taking longer than expected.

YoloV8 Coprocessor application running on Nvidia Jetson Orin

When I ran the demonstration Ultralytics YoloV8 object detection console application on my development desktop (13th Gen Intel(R) Core(TM) i7-13700 2.10 GHz with 32.0 GB) the pre-processing was much faster.

The much shorter pre-processing and longer inferencing durations were not a surprise as my development desktop does not have a Graphics Processing Unit(GPU)

Test image used for testing on Jetson device and development PC

The test image taken with my mobile was 3606×2715 pixels which was representative of the security cameras images to be processed by the solution.

Redgate ANTS Performance Profiler instrumentation of application execution

On my development box running the application with Redgate ANTS Performance Profiler highlighted that the Computnet YoloV8 code converting the image to a DenseTensor could be an issue.

 public static void ProcessToTensor(Image<Rgb24> image, Size modelSize, bool originalAspectRatio, DenseTensor<float> target, int batch)
 {
    var options = new ResizeOptions()
    {
       Size = modelSize,
       Mode = originalAspectRatio ? ResizeMode.Max : ResizeMode.Stretch,
    };

    var xPadding = (modelSize.Width - image.Width) / 2;
    var yPadding = (modelSize.Height - image.Height) / 2;

    var width = image.Width;
    var height = image.Height;

    // Pre-calculate strides for performance
    var strideBatchR = target.Strides[0] * batch + target.Strides[1] * 0;
    var strideBatchG = target.Strides[0] * batch + target.Strides[1] * 1;
    var strideBatchB = target.Strides[0] * batch + target.Strides[1] * 2;
    var strideY = target.Strides[2];
    var strideX = target.Strides[3];

    // Get a span of the whole tensor for fast access
    var tensorSpan = target.Buffer;

    // Try get continuous memory block of the entire image data
    if (image.DangerousTryGetSinglePixelMemory(out var memory))
    {
       Parallel.For(0, width * height, index =>
       {
             int x = index % width;
             int y = index / width;
             int tensorIndex = strideBatchR + strideY * (y + yPadding) + strideX * (x + xPadding);

             var pixel = memory.Span[index];
             WritePixel(tensorSpan.Span, tensorIndex, pixel, strideBatchR, strideBatchG, strideBatchB);
       });
    }
    else
    {
       Parallel.For(0, height, y =>
       {
             var rowSpan = image.DangerousGetPixelRowMemory(y).Span;
             int tensorYIndex = strideBatchR + strideY * (y + yPadding);

             for (int x = 0; x < width; x++)
             {
                int tensorIndex = tensorYIndex + strideX * (x + xPadding);
                var pixel = rowSpan[x];
                WritePixel(tensorSpan.Span, tensorIndex, pixel, strideBatchR, strideBatchG, strideBatchB);
             }
       });
    }
 }

 private static void WritePixel(Span<float> tensorSpan, int tensorIndex, Rgb24 pixel, int strideBatchR, int strideBatchG, int strideBatchB)
 {
    tensorSpan[tensorIndex] = pixel.R / 255f;
    tensorSpan[tensorIndex + strideBatchG - strideBatchR] = pixel.G / 255f;
    tensorSpan[tensorIndex + strideBatchB - strideBatchR] = pixel.B / 255f;
 }

For a 3606×2715 image the WritePixel method would be called tens of millions of times so its implementation and the overall approach used for ProcessToTensor has a significant impact on performance.

YoloV8 Coprocessor application running on Nvidia Jetson Orin with a resized image

Resizing the images had a significant impact on performance on the development box and Nividia Jetson Orin. This will need some investigation to see how much reducing the resizing the images impacts on the performance and accuracy of the model.

The ProcessToTensor method has already had some performance optimisations which improved performance by roughly 20%. There have been discussions about optimising similar code e.g. Efficient Bitmap to OnnxRuntime Tensor in C#, and Efficient RGB Image to Tensor in dotnet which look applicable and these will be evaluated.

YoloV8 ONNX – Nvidia Jetson Orin Nano™ GPU TensorRT Inferencing

Posted on June 17, 2024 by devmobilenz

The Seeedstudio reComputer J3011 has two processors an ARM64 CPU and an Nividia Jetson Orin 8G. To speed up inferencing on the Nividia Jetson Orin 8G with TensorRT I built an Open Neural Network Exchange(ONNX) T ensorRT Execution Provider.

Roboflow Universe Tennis Ball by Ugur ozdemir dataset

The Open Neural Network Exchange(ONNX) model used was trained on Roboflow Universe by Ugur ozdemir dataset which has 23696 images. The initial version of the TensorRT integration used the builder.UseTensorrt method of the IYoloV8Builder interface.

...
YoloV8Builder builder = new YoloV8Builder();

builder.UseOnnxModel(_applicationSettings.ModelPath);

if (_applicationSettings.UseTensorrt)
{
   Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss.fff} Using TensorRT");

   builder.UseTensorrt(_applicationSettings.DeviceId);
}
...

When the YoloV8.Coprocessor.Detect.Image application was configured to use the NVIDIA TensorRT Execution provider the average inference time was 58mSec but it took roughly 7 minutes to build and optimise the engine each time the application was run.

Generating the TensorRT engine every time the application is started

The TensorRT Execution provider has a number of configuration options but the IYoloV8Builder interface had to modified with UseCuda, UseRocm, UseTensorrt and UseTvm overloads implemented to allow additional configuration settings.

...
public class YoloV8Builder : IYoloV8Builder
{
...
    public IYoloV8Builder UseOnnxModel(BinarySelector model)
    {
        _model = model;

        return this;
    }

#if GPURELEASE
    public IYoloV8Builder UseCuda(int deviceId) => WithSessionOptions(SessionOptions.MakeSessionOptionWithCudaProvider(deviceId));

    public IYoloV8Builder UseCuda(OrtCUDAProviderOptions options) => WithSessionOptions(SessionOptions.MakeSessionOptionWithCudaProvider(options));

    public IYoloV8Builder UseRocm(int deviceId) => WithSessionOptions(SessionOptions.MakeSessionOptionWithRocmProvider(deviceId));
    
    // Couldn't test this don't have suitable hardware
    public IYoloV8Builder UseRocm(OrtROCMProviderOptions options) => WithSessionOptions(SessionOptions.MakeSessionOptionWithRocmProvider(options));

    public IYoloV8Builder UseTensorrt(int deviceId) => WithSessionOptions(SessionOptions.MakeSessionOptionWithTensorrtProvider(deviceId));

    public IYoloV8Builder UseTensorrt(OrtTensorRTProviderOptions options) => WithSessionOptions(SessionOptions.MakeSessionOptionWithTensorrtProvider(options));

    // Couldn't test this don't have suitable hardware
    public IYoloV8Builder UseTvm(string settings = "") => WithSessionOptions(SessionOptions.MakeSessionOptionWithTvmProvider(settings));
#endif
...
}

The trt_engine_cache_enable and trt_engine_cache_path TensorRT Execution provider session options configured the engine to be cached when it’s built for the first time so when a new inference session is created the engine can be loaded directly from disk.

...
YoloV8Builder builder = new YoloV8Builder();

builder.UseOnnxModel(_applicationSettings.ModelPath);

if (_applicationSettings.UseTensorrt)
{
   Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss.fff} Using TensorRT");

   OrtTensorRTProviderOptions tensorRToptions = new OrtTensorRTProviderOptions();

   Dictionary<string, string> optionKeyValuePairs = new Dictionary<string, string>();

   optionKeyValuePairs.Add("trt_engine_cache_enable", "1");
   optionKeyValuePairs.Add("trt_engine_cache_path", "enginecache/");

   tensorRToptions.UpdateOptions(optionKeyValuePairs);

   builder.UseTensorrt(tensorRToptions);
}
...

In order to validate that the loaded engine loaded from the trt_engine_cache_path is usable for the current inference, an engine profile is also cached and loaded along with engine

If current input shapes are in the range of the engine profile, the loaded engine can be safely used. If input shapes are out of range, the profile will be updated and the engine will be recreated based on the new profile.

Reusing the TensorRT engine built the first time the application is started

When the YoloV8.Coprocessor.Detect.Image application was configured to use NVIDIA TensorRT and the engine was cached the average inference time was 58mSec and the Build method took roughly 10sec to execute after the application had been run once.

The trtexec utility can “pre-generate” engines but there doesn’t appear a way to use them with the TensorRT Execution provider.

YoloV8 ONNX – Nvidia Jetson Orin Nano™ GPU CUDA Inferencing

Posted on June 13, 2024 by devmobilenz

The Seeedstudio reComputer J3011 has two processors an ARM64 CPU and an Nividia Jetson Orin 8G. To speed up inferencing with the Nividia Jetson Orin 8G with Compute Unified Device Architecture (CUDA) I built an Open Neural Network Exchange(ONNX) CUDA Execution Provider.

The Open Neural Network Exchange(ONNX) model used was trained on Roboflow Universe by Ugur ozdemir dataset which has 23696 images.

// load the app settings into configuration
var configuration = new ConfigurationBuilder()
      .AddJsonFile("appsettings.json", false, true)
.Build();

_applicationSettings = configuration.GetSection("ApplicationSettings").Get<Model.ApplicationSettings>();

Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss.fff} YoloV8 Model load: {_applicationSettings.ModelPath}");

YoloV8Builder builder = new YoloV8Builder();

builder.UseOnnxModel(_applicationSettings.ModelPath);

if (_applicationSettings.UseCuda)
{
   builder.UseCuda(_applicationSettings.DeviceId) ;
}

if (_applicationSettings.UseTensorrt)
{
   builder.UseTensorrt(_applicationSettings.DeviceId);
}

/*
builder.WithConfiguration(c =>
{
});
*/

/*
builder.WithSessionOptions(new Microsoft.ML.OnnxRuntime.SessionOptions()
{

});
*/

using (var image = await SixLabors.ImageSharp.Image.LoadAsync<Rgba32>(_applicationSettings.ImageInputPath))
using (var predictor = builder.Build())
{
   var result = await predictor.DetectAsync(image);

   Console.WriteLine();
   Console.WriteLine($"Speed: {result.Speed}");
   Console.WriteLine();

   foreach (var prediction in result.Boxes)
   {
      Console.WriteLine($" Class {prediction.Class} {(prediction.Confidence * 100.0):f1}% X:{prediction.Bounds.X} Y:{prediction.Bounds.Y} Width:{prediction.Bounds.Width} Height:{prediction.Bounds.Height}");
   }

   Console.WriteLine();

   Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss.fff} Plot and save : {_applicationSettings.ImageOutputPath}");

   using (var imageOutput = await result.PlotImageAsync(image))
   {
      await imageOutput.SaveAsJpegAsync(_applicationSettings.ImageOutputPath);
   }
}

When configured to run the YoloV8.Coprocessor.Detect.Image on the ARM64 CPU the average inference time was 729 mSec.

The first time ran the YoloV8.Coprocessor.Detect.Image application configured to use CUDA for inferencing it failed badly.

The YoloV8.Coprocessor.Detect.Image application was then configured to use CUDA and the average inferencing time was 85mSec.

It took a couple of weeks to get the YoloV8.Coprocessor.Detect.Image application inferencing on the Nividia Jetson Orin 8G coprocessor and this will be covered in detail in another posts.

Azure Event Grid YoloV8- Basic MQTT Client Pose Estimation

Posted on May 23, 2024 by devmobilenz

The Azure.EventGrid.Image.YoloV8.Pose application downloads images from a security camera, processes them with the default YoloV8(by Ultralytics) Pose Estimation model then publishes the results to an Azure Event Grid MQTT broker topic.

private async void ImageUpdateTimerCallback(object? state)
{
   DateTime requestAtUtc = DateTime.UtcNow;

   // Just incase - stop code being called while photo or prediction already in progress
   if (_ImageProcessing)
   {
      return;
   }
   _ImageProcessing = true;

   try
   {
      _logger.LogDebug("Camera request start");

      PoseResult result;

      using (Stream cameraStream = await _httpClient.GetStreamAsync(_applicationSettings.CameraUrl))
      {
         result = await _predictor.PoseAsync(cameraStream);
      }

      _logger.LogInformation("Speed Preprocess:{Preprocess} Postprocess:{Postprocess}", result.Speed.Preprocess, result.Speed.Postprocess);


      if (_logger.IsEnabled(LogLevel.Debug))
      {
         _logger.LogDebug("Pose results");

         foreach (var box in result.Boxes)
         {
            _logger.LogDebug(" Class:{box.Class} Confidence:{Confidence:f1}% X:{X} Y:{Y} Width:{Width} Height:{Height}", box.Class.Name, box.Confidence * 100.0, box.Bounds.X, box.Bounds.Y, box.Bounds.Width, box.Bounds.Height);

            foreach (var keypoint in box.Keypoints)
            {
               Model.PoseMarker poseMarker = (Model.PoseMarker)keypoint.Index;

               _logger.LogDebug("  Class:{Class} Confidence:{Confidence:f1}% X:{X} Y:{Y}", Enum.GetName(poseMarker), keypoint.Confidence * 100.0, keypoint.Point.X, keypoint.Point.Y);
            }
         }
      }

      var message = new MQTT5PublishMessage
      {
         Topic = string.Format(_applicationSettings.PublishTopic, _applicationSettings.UserName),
         Payload = Encoding.ASCII.GetBytes(JsonSerializer.Serialize(new
         {
            result.Boxes
         })),
         QoS = _applicationSettings.PublishQualityOfService,
      };

      _logger.LogDebug("HiveMQ.Publish start");

      var resultPublish = await _mqttclient.PublishAsync(message);

      _logger.LogDebug("HiveMQ.Publish done");
   }
   catch (Exception ex)
   {
      _logger.LogError(ex, "Camera image download, processing, or telemetry failed");
   }
   finally
   {
      _ImageProcessing = false;
   }

   TimeSpan duration = DateTime.UtcNow - requestAtUtc;

   _logger.LogDebug("Camera Image download, processing and telemetry done {TotalSeconds:f2} sec", duration.TotalSeconds);
}

The application uses a Timer(with configurable Due and Period times) to poll the security camera, detect objects in the image then publish a JavaScript Object Notation(JSON) representation of the results to Azure Event Grid MQTT broker topic using a HiveMQ client.

The Unv ADZK-10 camera used in this sample has a Hypertext Transfer Protocol (HTTP) Uniform Resource Locator(URL) for downloading the current image. Like the YoloV8.Detect.SecurityCamera.Stream sample the image “streamed” using the HttpClient.GetStreamAsync to the YoloV8 PoseAsync method.

Azure.EventGrid.Image.YoloV8.Pose application console output

The same approach as the YoloV8.Detect.SecurityCamera.Stream sample is used because the image doesn’t have to be saved on the local filesystem.

To check the results, I put a breakpoint in the timer just after PoseAsync method is called and then used the Visual Studio 2022 Debugger QuickWatch functionality to inspect the contents of the PoseResult object.

Visual Studio 2022 Debugger PoseResult Quickwatch

For testing I configured a single Azure Event Grid custom topic subscription an Azure Storage Queue.

An Azure Storage Queue is an easy way to store messages while debugging/testing an application.

Azure Storage Explorer is a good tool for listing recent messages, then inspecting their payloads.

The Azure Event Grid custom topic message text(in data_base64) contains the JavaScript Object Notation(JSON) of the pose detection result.

{"Boxes":[{"Keypoints":[{"Index":0,"Point":{"X":744,"Y":58,"IsEmpty":false},"Confidence":0.6334442},{"Index":1,"Point":{"X":746,"Y":33,"IsEmpty":false},"Confidence":0.759928},{"Index":2,"Point":{"X":739,"Y":46,"IsEmpty":false},"Confidence":0.19036674},{"Index":3,"Point":{"X":784,"Y":8,"IsEmpty":false},"Confidence":0.8745915},{"Index":4,"Point":{"X":766,"Y":45,"IsEmpty":false},"Confidence":0.086735755},{"Index":5,"Point":{"X":852,"Y":50,"IsEmpty":false},"Confidence":0.9166329},{"Index":6,"Point":{"X":837,"Y":121,"IsEmpty":false},"Confidence":0.85815763},{"Index":7,"Point":{"X":888,"Y":31,"IsEmpty":false},"Confidence":0.6234426},{"Index":8,"Point":{"X":871,"Y":205,"IsEmpty":false},"Confidence":0.37670398},{"Index":9,"Point":{"X":799,"Y":21,"IsEmpty":false},"Confidence":0.3686208},{"Index":10,"Point":{"X":768,"Y":205,"IsEmpty":false},"Confidence":0.21734264},{"Index":11,"Point":{"X":912,"Y":364,"IsEmpty":false},"Confidence":0.98523325},{"Index":12,"Point":{"X":896,"Y":382,"IsEmpty":false},"Confidence":0.98377174},{"Index":13,"Point":{"X":888,"Y":637,"IsEmpty":false},"Confidence":0.985927},{"Index":14,"Point":{"X":849,"Y":645,"IsEmpty":false},"Confidence":0.9834709},{"Index":15,"Point":{"X":951,"Y":909,"IsEmpty":false},"Confidence":0.96191007},{"Index":16,"Point":{"X":921,"Y":894,"IsEmpty":false},"Confidence":0.9618156}],"Class":{"Id":0,"Name":"person"},"Bounds":{"X":690,"Y":3,"Width":315,"Height":1001,"Location":{"X":690,"Y":3,"IsEmpty":false},"Size":{"Width":315,"Height":1001,"IsEmpty":false},"IsEmpty":false,"Top":3,"Right":1005,"Bottom":1004,"Left":690},"Confidence":0.8341071}]}

YoloV8 ONNX – Nvidia Jetson Orin Nano™ ARM64 CPU Inferencing

Posted on May 19, 2024 by devmobilenz

I configured the demonstration Ultralytics YoloV8 object detection(yolov8s.onnx) console application to process a 1920×1080 image from a security camera on my desktop development box (13th Gen Intel(R) Core(TM) i7-13700 2.10 GHz with 32.0 GB)

Object Detection sample application running on my development box

A Seeedstudio reComputer J3011 uses a Nividia Jetson Orin 8G and looked like a cost-effective platform to explore how a dedicated Artificial Intelligence (AI) co-processor could reduce inferencing times.

To establish a “baseline” I “published” the demonstration application on my development box which created a folder with all the files required to run the application on the Seeedstudio reComputer J3011 ARM64 CPU. I had to manually merge the “User Secrets” and appsettings.json files so the camera connection configuration was correct.

The runtimes folder contained a number of folders with the native runtime files for the supported Open Neural Network Exchange(ONNX) platforms

Object Detection application publish runtimes folder

This Nividia Jetson Orin ARM64 CPU requires the linux-arm64 ONNX runtime which was “automagically” detected. (in previous versions of ML.Net the native runtime had to be copied to the execution directory)

The final step was to use the demonstration Ultralytics YoloV8 object detection(yolov8s.onnx) console application to process a 1920×1080 image from a security camera on the reComputer J3011 (6-core Arm® Cortex®64-bit CPU 1.5Ghz processor)

Object Detection sample application running on my Seeedstudio reComputer J3011

When I averaged the pre-processing, inferencing and post-processing times for both devices over 20 executions my development box was much faster which was not a surprise. Though the reComputer J3011 post processing times were a bit faster than I was expecting

ARM64 CPU Preprocess 0.05s Inference 0.31s Postprocess 0.05

YoloV8-Training a model with Ultralytics Hub

Posted on April 21, 2024 by devmobilenz

After uploading the roboflow Tennis Ball dataset from my previous post to an Ultralytics Hub dataset. I then used my Ultralytics Pro plan to train a proof of concept(PoC) YoloV8 model.

Selecting training type the dataset to upload

Confirming the number of classes and splits of the training dataset

Selecting the output model architecture (YoloV8s).

Configuring the number of epochs and payment method

Preparing the cloud instance(s) for training

The training process completed with some basic model metrics.

The resources used and model accuracy metrics.

Testing the trained model inference results with my test image.

Exporting the trained YoloV8 model in ONNX format.

The duration and cost of training the model.

Testing the YoloV8 model with the dem-compunet.Image console application

Marked-up image generated by the dem-compunet.Image console application.

In this post I have not covered YoloV8 model selection and tuning of the training configuration to optimise the “performance” of the model. I used the default settings and then ran the model training overnight which cost USD6.77

This post is not about how create a “good” model it is the approach I took to create a “proof of concept” model for a demonstration.

YoloV8-Selecting a roboflow dataset

Posted on April 20, 2024 by devmobilenz

To comply with the Ultralytics AGPL-3.0 License and to use an Ultralytics Pro plan the source code and models for an application have to be open source. Rather than publishing my YoloV8 model (which is quite large) this is the first in a series of posts which detail the process I used to create it. (which I think is more useful)

The single test image (not a good idea) is a photograph of 30 tennis balls on my living room floor.

Test image of 30 tennis balls on my living room floor

I stared with the “default” yolov8s.onnx model which is included in the YoloV8 nuget package Github repository YoloV8.Demo application.

YoloV8s.Onnx Tennis ball object detection results

The object detection results using the “default” model were pretty bad, but this wasn’t a surprise as the model is not optimised for this sort of problem.

Roboflow has a suite of tools for annotating, automatic labelling, training and deployment of models as well as a roboflow universe which (according to their website) is “The largest resource of computer vision datasets and pre-trained models”.

roboflow universe open-source model dataset search

I have used datasets from roboflow universe which is a great resource for building “proof of concept” applications.

The first step was to identify some datasets which would improve my tennis ball object detection model results. After some searching (with tennis, tennis-ball etc. classes) and filtering (object detection, has a model for faster evaluation, more the 5000 images) to reduce the search results to a manageable number, I identified 5 datasets worth further evaluation.

In my scenario the performance of the Acebot by Mrunal model was worse than the “default” yolov8s model.

In my scenario the performance of the tennis racket by test model was similar to the “default” yolov8s model.

In my scenario the performance of the Tennis Ball by Hust model was a bit better than the “default” yolov8s mode

In my scenario the performance of the roboflow_oball by ahmedelshalkany model was pretty good it detected 28 of the 30 tennis balls.

In my scenario the performance of the Tennis Ball by Ugur Ozdemir model was good it detected all of the 30 tennis balls.

I then exported the Tennis Ball by Ugur Ozdemir dataset in a YoloV8 compatible format so I could use it on the Ultralytics Hub service with my Ultralytics Pro plan to train a model.

This post is not about how create a “good” dataset it is the approach I took to create a “proof of concept” dataset for a demonstration.

devMobile's blog

Random wanderings through Microsoft Azure esp. PaaS plumbing, the IoT bits, AI on Micro controllers, AI on Edge Devices, .NET nanoFramework, .NET Core on *nix and ML.NET+ONNX

Tag Archives: ultralytics