Seeedstudio XIAO ESP32 S3 RS-485 test harness(nanoFramework)

As part of a project to read values from a MODBUS RS-485 sensor using a RS-485 Breakout Board for Seeed Studio XIAO and a Seeed Studio XIAO ESP32-S3 I built a .NET nanoFramework version of the Arduino test harness described in this wiki post.

This took a bit longer than I expected mainly because running two instances of Visual Studio 2026 was a problem (running Visual Studio 2022 for one device and Visual Studio 2026 for the other, though not 100% confident this was an issue) as there were some weird interactions.

using nanoff to flash a device with the latest version of ESP32_S3_ALL_UART

As I moved between the Arduino tooling and flashing devices with nanoff the serial port numbers would change watching the port assignments in Windows Device Manager was key.

Windows Device manager displaying the available serial ports

Rather than debugging both the nanoFramework RS485Sender and RS485Receiver applications simultaneously, I used the Arduino RS485Sender and RS485 Receiver application but had similar issues with the port assignments changing.

Arduino RS485 Sender application
The nanoFramework sender application
public class Program
{
   static SerialPort _serialDevice;

   public static void Main()
   {
      Configuration.SetPinFunction(Gpio.IO06, DeviceFunction.COM2_RX);
      Configuration.SetPinFunction(Gpio.IO05, DeviceFunction.COM2_TX);
      Configuration.SetPinFunction(Gpio.IO02, DeviceFunction.COM2_RTS);

      Debug.WriteLine("RS485 Sender: ");

      var ports = SerialPort.GetPortNames();

      Debug.WriteLine("Available ports: ");
      foreach (string port in ports)
      {
         Debug.WriteLine($" {port}");
      }

      _serialDevice = new SerialPort("COM2");
      _serialDevice.BaudRate = 9600;
      _serialDevice.Mode = SerialMode.RS485;

      _serialDevice.Open();

      Debug.WriteLine("Sending...");
      while (true)
      {
         string payload = $"{DateTime.UtcNow:HHmmss}";

         Debug.WriteLine($"Sent:{DateTime.UtcNow:HHmmss}");

         Debug.WriteLine(payload);

         _serialDevice.WriteLine(payload);

         Thread.Sleep(2000);
      }
   }
}

if I had built the nanoFramework RS485Sender and RS485Receiver applications first debugging the Arduino RS485Sender and RS485Receiver would been similar.

Arduino receiver application displaying messages from the nanoFramework sender application
The nanoFramework Receiver receiving messages from the nanoFramework Sender
public class Program
{
   static SerialPort _serialDevice ;
 
   public static void Main()
   {
      Configuration.SetPinFunction(Gpio.IO06, DeviceFunction.COM2_RX);
      Configuration.SetPinFunction(Gpio.IO05, DeviceFunction.COM2_TX);
      Configuration.SetPinFunction(Gpio.IO02, DeviceFunction.COM2_RTS);

      Debug.WriteLine("RS485 Receiver ");

      // get available ports
      var ports = SerialPort.GetPortNames();

      Debug.WriteLine("Available ports: ");
      foreach (string port in ports)
      {
         Debug.WriteLine($" {port}");
      }

      // set parameters
      _serialDevice = new SerialPort("COM2");
      _serialDevice.BaudRate = 9600;
      _serialDevice.Mode = SerialMode.RS485;

      // set a watch char to be notified when it's available in the input stream
      _serialDevice.WatchChar = '\n';

      // setup an event handler that will fire when a char is received in the serial device input stream
      _serialDevice.DataReceived += SerialDevice_DataReceived;

      _serialDevice.Open();

      Debug.WriteLine("Waiting...");
      Thread.Sleep(Timeout.Infinite);
   }

   private static void SerialDevice_DataReceived(object sender, SerialDataReceivedEventArgs e)
   {
      SerialPort serialDevice = (SerialPort)sender;

      switch (e.EventType)
      {
         case SerialData.Chars:
         //break;

         case SerialData.WatchChar:
            string response = serialDevice.ReadExisting();
            Debug.Write($"Received:{response}");
            break;
         default:
            Debug.Assert(false, $"e.EventType {e.EventType} unknown");
            break;
      }
   }
}

The changing of serial port numbers while running different combinations of Arduino and nanoFramework environments concurrently combined with the sender and receiver applications having to be deployed to the right devices (also initially accidentally different baud rates) was a word of pain, and with the benefit of hindsight I should have used two computers.

Azure Event Grid nanoFramework Client – Publisher

Building a .NET nanoFramework application for testing Azure Event Grid MQTT Broker connectivity that would run on my Seeedstudio EdgeBox ESP100 and Seeedstudio Xiao ESP32S3 devices took a couple of hours. Most of that time was spent figuring out how to generate the certificate and elliptic curve private key

Create an elliptic curve private key

 openssl ecparam -name prime256v1 -genkey -noout -out device.key

Generate a certificate signing request

openssl req -new -key device.key -out device.csr -subj "/CN=device.example.com/O=YourOrg/OU=IoT"

Then use the intermediate certificate and key file from earlier to generate a device certificate and key.

 openssl x509 -req -in device.csr -CA IntermediateCA.crt -CAkey IntermediateCA.key -CAcreateserial -out device.crt -days 365 -sha256

In this post I have assumed that the reader is familiar with configuring Azure Event Grid clients, client groups, topic spaces, permission bindings and routing.

The PEM encoded root CA certificate chain that is used to validate the server
public const string CA_ROOT_PEM = @"-----BEGIN CERTIFICATE-----
CN: CN = Microsoft Azure ECC TLS Issuing CA 03
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
CN: CN = DigiCert Global Root G3
-----END CERTIFICATE-----";

The PEM encoded certificate chain that is used to authenticate the device
public const string CLIENT_CERT_PEM_A = @"-----BEGIN CERTIFICATE-----
-----BEGIN CERTIFICATE-----
 CN=Self signed device certificate
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
 CN=Self signed Intermediate certificate
-----END CERTIFICATE-----";

 The PEM encoded private key of device
public const string CLIENT_KEY_PEM_A = @"-----BEGIN EC PRIVATE KEY-----
-----END EC PRIVATE KEY-----";

My application was “inspired” by the .NET nanoFramework m2mqtt example.

public static void Main()
{
   int sequenceNumber = 0;
   MqttClient mqttClient = null;
   Thread.Sleep(1000); // Found this works around some issues with running immediately after a reset

   bool wifiConnected = false;
   Console.WriteLine("WiFi connecting...");
   do
   {
      // Attempt to connect using DHCP
      wifiConnected = WifiNetworkHelper.ConnectDhcp(Secrets.WIFI_SSID, Secrets.WIFI_PASSWORD, requiresDateTime: true);

      if (!wifiConnected)
      {
         Console.WriteLine($"Failed to connect. Error: {WifiNetworkHelper.Status}");
         if (WifiNetworkHelper.HelperException != null)
         {
            Console.WriteLine($"Exception: {WifiNetworkHelper.HelperException}");
         }

         Thread.Sleep(1000);
      }
   }
   while (!wifiConnected);
   Console.WriteLine("WiFi connected");

   var caCert = new X509Certificate(Constants.CA_ROOT_PEM);

   X509Certificate2 clientCert = null;
   try
   {
      clientCert = new X509Certificate2(Secrets.CLIENT_CERT_PEM_A, Secrets.CLIENT_KEY_PEM_A, string.Empty);
   }
   catch (Exception ex)
   {
      Console.WriteLine($"Client Certificate Exception: {ex.Message}");
   }

   mqttClient = new MqttClient(Secrets.MQTT_SERVER, Constants.MQTT_PORT, true, caCert, clientCert, MqttSslProtocols.TLSv1_2);

   mqttClient.ProtocolVersion = MqttProtocolVersion.Version_5;

   bool mqttConnected = false;
   Console.WriteLine("MQTT connecting...");
   do
   {
      try
      {
         // Regular connect
         var resultConnect = mqttClient.Connect(Secrets.MQTT_CLIENTID, Secrets.MQTT_USERNAME, Secrets.MQTT_PASSWORD);
         if (resultConnect != MqttReasonCode.Success)
         {
            Console.WriteLine($"MQTT ERROR connecting: {resultConnect}");
            Thread.Sleep(1000);
         }
         else
         {
            mqttConnected = true;
         }
      }
      catch (Exception ex)
      {
         Console.WriteLine($"MQTT ERROR Exception '{ex.Message}'");
         Thread.Sleep(1000);
      }
   }
   while (!mqttConnected);
   Console.WriteLine("MQTT connected...");

   mqttClient.MqttMsgPublishReceived += MqttMsgPublishReceived;
   mqttClient.MqttMsgSubscribed += MqttMsgSubscribed;
   mqttClient.MqttMsgUnsubscribed += MqttMsgUnsubscribed;
   mqttClient.ConnectionOpened += ConnectionOpened;
   mqttClient.ConnectionClosed += ConnectionClosed;
   mqttClient.ConnectionClosedRequest += ConnectionClosedRequest;

   string topicPublish = string.Format(MQTT_TOPIC_PUBLISH_FORMAT, Secrets.MQTT_CLIENTID);
   while (true)
   {
      Console.WriteLine("MQTT publish message start...");

      var payload = new MessagePayload() { ClientID = Secrets.MQTT_CLIENTID, Sequence = sequenceNumber++ };

      string jsonPayload = JsonSerializer.SerializeObject(payload);

      var result = mqttClient.Publish(topicPublish, Encoding.UTF8.GetBytes(jsonPayload), "application/json; charset=utf-8", null);

      Debug.WriteLine($"MQTT published ({result}): {jsonPayload}");

      Thread.Sleep(100);
   }
}

I then configured my client (Edgebox100Z) and updated the “secrets.cs” file

Azure Event Grid MQTT Broker Clients

The application connected to the Azure Event Grid MQTT broker and started publishing the JSON payload with the incrementing sequence number.

Visual Studio debugger output of JSON payload publishing

The published messages were “routed” to an Azure Storage Queue where they could be inspected with a tool like Azure Storage Explorer.

Azure Event Grid MQTT Broker metrics with messages published selected

I could see the application was working in the Azure Event Grid MQTT broker metrics because the number of messages published was increasing.

Seeedstudio XIAO ESP32 S3 RS-485 test harness(Arduino)

As part of a project to read values from a MODBUS RS-485 sensor using a RS-485 Breakout Board for Seeed Studio XIAO and a Seeed Studio XIAO ESP32-S3 I built the test harness described in the wiki post. The test harness setup for a Seeed Studio XIAO ESP32-C3/Seeed Studio XIAO ESP32-C6 didn’t work with my Seeed Studio XIAO ESP32-S3.

I then did some digging looked at schematics and figured out the port mappings were different. This took a while so I tried Microsoft Copilot

I then updated the port assigned for my RS485Sender application

#include <HardwareSerial.h>

HardwareSerial RS485(1);

#define enable_pin D2

void setup() {
  Serial.begin(9600);  // Initialize the hardware serial with a baud rate of 115200
  delay(5000);

  Serial.println("RS485 Sender");

  // Wait for the hardware serial to be ready
  while (!Serial)
    ;
  Serial.println("!Serial done");

  //mySerial.begin(115200, SERIAL_8N1, 7, 6); // RX=D4(GPIO6), TX=D5(GPIO7) Doesn't work
  RS485.begin(115200, SERIAL_8N1, 6, 5);

  // Wait for the hardware serial to be ready
  while (!RS485)
    ;
  Serial.println("!RS485 done ");

  pinMode(enable_pin, OUTPUT);     // Set the enable pin as an output
  digitalWrite(enable_pin, HIGH);  // Set the enable pin to high
}

void loop() {
  if (Serial.available()) {
    String inputData = Serial.readStringUntil('\n');  // Read the data from the hardware serial until a newline character

    // If the received data is not empty
    if (inputData.length() > 0) {
      Serial.println("Send successfully");  // Print a success message
      RS485.println(inputData);             // Send the received data to the hardware serial
    }
  }
}

I then updated the port assigned for my RS485Receiver application

#include <HardwareSerial.h>

HardwareSerial RS485(1);  // Use UART2
#define enable_pin D2

void setup() {
  Serial.begin(9600);  // Initialize the hardware serial with a baud rate of 115200
  delay(5000);

  Serial.println("RS485 Receiver");

  // Wait for the hardware serial to be ready
  while (!Serial)
    ;
  Serial.println("!Serial done");

  // mySerial.begin(115200, SERIAL_8N1, 7, 6); // RX=D4(GPIO6), TX=D5(GPIO7) Doesn't seem to work
  RS485.begin(115200, SERIAL_8N1, 6, 5); 
  
    // Wait for the hardware serial to be ready
  while (!RS485)
    ;
  Serial.println("!RS485 done ");

  pinMode(enable_pin, OUTPUT);    // Set the enable pin as an output
  digitalWrite(enable_pin, LOW);  // Set the enable pin to low
}

void loop() {
  // Check if there is data available from the hardware serial
  int x = RS485.available();

  if (x) {
    String response = RS485.readString();

    Serial.println(" RS485 Response: " + response);
  }

  delay(1000);
}

Getting my test harness RS485Sender and RS485Receiver applications (inspired by Seeedstudio wiki) took quite a bit longer than expected. Using Copilot worked better than expected but I think that might be because after doing some research my prompts were better.

RTSP Camera Nager.VideoStream Startup Latency

While working on my RTSPCameraNagerVideoStream project I noticed that after opening the Realtime Streaming Protocol(RTSP) connection with my HiLook IPCT250H Security Camera it took a while for the application to start writing image files.

HiLook IPCT250H Camera configuration

My test harness code was “inspired” by the Nager.VideoStream.TestConsole application with a slightly different file format for the start-stop marker text and camera images files.

private static async Task StartStreamProcessingAsync(InputSource inputSource, CancellationToken cancellationToken = default)
{
   Console.WriteLine("Start Stream Processing");
   try
   {
      var client = new VideoStreamClient();

      client.NewImageReceived += NewImageReceived;
#if FFMPEG_INFO_DISPLAY
      client.FFmpegInfoReceived += FFmpegInfoReceived;
#endif
      File.WriteAllText(Path.Combine(_applicationSettings.ImageFilepathLocal, $"{DateTime.UtcNow:yyyyMMdd-HHmmss.fff}.txt"), "Start");

      await client.StartFrameReaderAsync(inputSource, OutputImageFormat.Png, cancellationToken: cancellationToken);

      File.WriteAllText(Path.Combine(_applicationSettings.ImageFilepathLocal, $"{DateTime.UtcNow:yyyyMMdd-HHmmss.fff}.txt"), "Finish");

      client.NewImageReceived -= NewImageReceived;
#if FFMPEG_INFO_DISPLAY
      client.FFmpegInfoReceived -= FFmpegInfoReceived;
#endif
      Console.WriteLine("End Stream Processing");
   }
   catch (Exception exception)
   {
      Console.WriteLine($"{exception}");
   }
}

private static void NewImageReceived(byte[] imageData)
{
   Debug.WriteLine($"{DateTime.UtcNow:yy-MM-dd HH:mm:ss.fff} NewImageReceived");

   File.WriteAllBytes( Path.Combine(_applicationSettings.ImageFilepathLocal, $"{DateTime.UtcNow:yyyyMMdd-HHmmss.fff}.png"), imageData);
}

I used Path.Combine so no code or configuration changes were required when the application was run on different operating systems (still need to ensure ImageFilepathLocal in the appsettings.json is the correct format).

Developer Desktop

I used my desktop computer a 13th Gen Intel(R) Core(TM) i7-13700 2.10 GHz with 32.0 GB running Windows 11 Pro 24H2.

In the test results below (representative of multiple runs while testing) the delay between starting streaming and the first image file was on average 3.7 seconds with the gap between the images roughly 100mSec.

Files written by NagerVideoStream timestamps roughly 100mSec apart, but 3

Industrial Computer

I used a reComputer J3011 – Edge AI Computer with NVIDIA® Jetson™ Orin™ Nano 8GB running Ubuntu 22.04.5 LTS (Jammy Jellyfish)

In the test results below (representative of multiple runs while testing) the delay between starting streaming and the first image file was on average roughly 3.7 seconds but the time between images varied a lot from 30mSec to >300mSec.

At 10FPS the results for my developer desktop were more consistent, and the reComputer J3011 had significantly more “jitter”. Both could cope with 1oFPS so the next step is to integrate YoloDotNet library to process the video frames.

YoloV8 ONNX – Nvidia Jetson Orin Nano™ Execution Providers

The Seeedstudio reComputer J3011 has two processors an ARM64 CPU and an Nvidia Jetson Orin 8G which can be used for inferencing with the Open Neural Network Exchange(ONNX)Runtime.

Story of Fail

Inferencing worked first time on the ARM64 CPU because the required runtime is included in the Microsoft.ML.OnnxRuntime NuGet

ARM64 Linux ONNX runtime
Microsoft.ML.OnnxRuntime NuGet ARM64 Linux runtime

Inferencing failed on the Nividia Jetson Orin 8G because the CUDA Execution provider and TensorRT Execution Provider for the ONNXRuntime were not included in the Microsoft.ML.OnnxRuntime.GPU.Linux NuGet.

Missing ARM64 Linux GPU runtime

There were Linux x64 and Windows x64 versions of the ONNXRuntime library included in the Microsoft.ML.OnnxRuntime.Gpu NuGet

Microsoft.ML.OnnxRuntime.Gpu NuGet x64 Linux runtime

Desperately Seeking libonnxruntime.so

The Nvidia ONNX runtime site had pip wheel files for the different versions of Python and the Open Neural Network Exchange(ONNX)Runtime.

The onnxruntime_gpu-1.18.0-cp312-cp312-linux_aarch64.whl matched the version of the ONNXRuntime I needed and version of Python on the device..

When the pip wheel file was renamed onnxruntime_gpu-1.18.0-cp312-cp312-linux_aarch64.zip it could be opened, but there wasn’t a libonnruntime.so.

Onnxruntime_gpu-1.18.0-cp312-cp312-linux_aarch64 file listing

Building the TensorRT & CUDA Execution Providers

The ONNXRuntime build has to be done on Nividia Jetson Orin so after installing all the necessary prerequisites the first attempt failed.

bryn@ubuntu:~/onnxruntime/onnxruntime$ ./build.sh --config Release --update --build --build_wheel \
--use_tensorrt --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu \
--tensorrt_home /usr/lib/aarch64-linux-gnu

When in high power mode more cores are used but this consumes more resource when building the ONNXRuntime. To limit resource utilisation --parallel2 was added the command line because the compile process was having “out of memory” failures.

bryn@ubuntu:~/onnxruntime/onnxruntime$ ./build.sh --config Release --update --build --parallel 2 --build_wheel \
--use_tensorrt --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu \
--tensorrt_home /usr/lib/aarch64-linux-gnu

There were some compiler warnings but they appear to be benign.

First attempt at running the application failed because libonnxruntime.so was missing so –build_shared_lib was added to the command line

2024-06-10 18:21:58,480 build [INFO] - Build complete
bryn@ubuntu:~/onnxruntime/onnxruntime$ ./build.sh --config Release --update --build --parallel 2 --build_wheel --use_tensorrt --cuda_home /usr/local/cuda --cudnn_home /usr/lib/aarch64-linux-gnu --tensorrt_home /usr/lib/aarch64-linux-gnu --build_shared_lib

When the build completed the files were copied to the runtime folder of the program.

The application could then be configured to use the TensorRT Execution Provider.

Getting CUDA and TensorRT working on the Nvidia Jetson Orin 8G took much longer than I expected, with many dead ends and device factory resets before the process was repeatable.

YoloV8 ONNX – Nvidia Jetson Orin Nano™ CPU & GPU TensorRT Inferencing

The Seeedstudio reComputer J3011 has two processors an ARM64 CPU and an Nividia Jetson Orin 8G. To speed up TensorRT inferencing I built an Open Neural Network Exchange(ONNX) TensorRT Execution Provider. After updating the code to add a “warm-up” and tracking of average pre-processing, inferencing & post-processing durations I did a series of CPU & GPU performance tests.

The testing consisted of permutations of three models TennisBallsYoloV8s20240618640×640.onnx, TennisBallsYoloV8s2024062410241024.onnx & TennisBallsYoloV8x20240614640×640 (limited testing as slow) and three images TennisBallsLandscape640x640.jpg, TennisBallsLandscape1024x1024.jpg & TennisBallsLandscape3072x4080.jpg.

Executive Summary

As expected, inferencing with a TensorRT 640×640 model and a 640×640 image was fastest, 9mSec pre-processing, 21mSec inferencing, then 4mSec post-processing.

If the image had to be scaled with SixLabors.ImageSharp this significantly increased the preprocessing (and overall) time.

CPU Inferencing

GPU TensorRT Small model Inferencing

GPU TensorRT Large model Inferencing

Nvidia Jetson Orin Nano™ JetPack 6

The Seeedstudio reComputer J3011 has two processors an ARM64 CPU and an Nividia Jetson Orin 8G Coprocessor. To speed up ML.NET running on the Nividia Jetson Orin 8G required compatible versions of ML.NET Open Neural Network Exchange (ONNX) and NVIDIA Jetpack.

Before installing NVIDIA Jetpack 6 the Seeedstudio reComputer J3011 Edge AI Device has to be put into recovery mode

Seeedstudio reComputer J3011 Edge AI Device with jumper for recovery mode

When started in recovery mode the Seeedstudio J3011 was in list of Universal Serial Bus (USB) devices returned by lsusb

Upgrading to Jetpack 5.1.1 so the device could be upgraded using the Windows subsystem for Linux terminal failed. The NVIDIA SDK Manager downloads and installs all the required components and dependencies.

Installing NVIDIA Jetpack 6 from the Windows subsystem for Linux failed because the version of Ubuntu installed(Ubuntu 24.02 LTS) was not supported by NVIDIA SDK Manager.

Installing NVIDIA Jetpack 6 from a desktop PC running Ubuntu 24.02 LTS failed (the same issue as above) because the NVIDIA SDK Manager did not support that version of Ubuntu. The desktop PC was then “re-paved” with Ubuntu 22.04 LTS and NVIDIA SDK Manager worked.

An NVIDIA Developer program login is required to launch the NVIDIA SDK Manager

Selecting the right target hardware is important if it is not “auto detected”.

The Open Neural Network Exchange(ONNX) supports the Compute Unified Device Architecture (CUDA) which has to be included in the installation package.

Downloading NVIDIA Jetpack 6 and all the selected components of the install can be quite slow

Installation of NVIDIA Jetpack 6 and selected components can take a while.

Even though Jetpack 6 is now available for Seeed’s Jetson Orin Devices this process is still applicable for an upgrade or “factory reset”.

YoloV8 ONNX – Nvidia Jetson Orin Nano™ DenseTensor Performance

When running the YoloV8 Coprocessor demonstration on the Nividia Jetson Orin inferencing looked a bit odd, the dotted line wasn’t moving as fast as expected. To investigate this further I split the inferencing duration into pre-processing, inferencing and post-processing times. Inferencing and post-processing were “quick”, but pre-processing was taking longer than expected.

YoloV8 Coprocessor application running on Nvidia Jetson Orin

When I ran the demonstration Ultralytics YoloV8 object detection console application on my development desktop (13th Gen Intel(R) Core(TM) i7-13700 2.10 GHz with 32.0 GB) the pre-processing was much faster.

The much shorter pre-processing and longer inferencing durations were not a surprise as my development desktop does not have a Graphics Processing Unit(GPU)

Test image used for testing on Jetson device and development PC

The test image taken with my mobile was 3606×2715 pixels which was representative of the security cameras images to be processed by the solution.

Redgate ANTS Performance Profiler instrumentation of application execution

On my development box running the application with Redgate ANTS Performance Profiler highlighted that the Computnet YoloV8 code converting the image to a DenseTensor could be an issue.

 public static void ProcessToTensor(Image<Rgb24> image, Size modelSize, bool originalAspectRatio, DenseTensor<float> target, int batch)
 {
    var options = new ResizeOptions()
    {
       Size = modelSize,
       Mode = originalAspectRatio ? ResizeMode.Max : ResizeMode.Stretch,
    };

    var xPadding = (modelSize.Width - image.Width) / 2;
    var yPadding = (modelSize.Height - image.Height) / 2;

    var width = image.Width;
    var height = image.Height;

    // Pre-calculate strides for performance
    var strideBatchR = target.Strides[0] * batch + target.Strides[1] * 0;
    var strideBatchG = target.Strides[0] * batch + target.Strides[1] * 1;
    var strideBatchB = target.Strides[0] * batch + target.Strides[1] * 2;
    var strideY = target.Strides[2];
    var strideX = target.Strides[3];

    // Get a span of the whole tensor for fast access
    var tensorSpan = target.Buffer;

    // Try get continuous memory block of the entire image data
    if (image.DangerousTryGetSinglePixelMemory(out var memory))
    {
       Parallel.For(0, width * height, index =>
       {
             int x = index % width;
             int y = index / width;
             int tensorIndex = strideBatchR + strideY * (y + yPadding) + strideX * (x + xPadding);

             var pixel = memory.Span[index];
             WritePixel(tensorSpan.Span, tensorIndex, pixel, strideBatchR, strideBatchG, strideBatchB);
       });
    }
    else
    {
       Parallel.For(0, height, y =>
       {
             var rowSpan = image.DangerousGetPixelRowMemory(y).Span;
             int tensorYIndex = strideBatchR + strideY * (y + yPadding);

             for (int x = 0; x < width; x++)
             {
                int tensorIndex = tensorYIndex + strideX * (x + xPadding);
                var pixel = rowSpan[x];
                WritePixel(tensorSpan.Span, tensorIndex, pixel, strideBatchR, strideBatchG, strideBatchB);
             }
       });
    }
 }

 private static void WritePixel(Span<float> tensorSpan, int tensorIndex, Rgb24 pixel, int strideBatchR, int strideBatchG, int strideBatchB)
 {
    tensorSpan[tensorIndex] = pixel.R / 255f;
    tensorSpan[tensorIndex + strideBatchG - strideBatchR] = pixel.G / 255f;
    tensorSpan[tensorIndex + strideBatchB - strideBatchR] = pixel.B / 255f;
 }

For a 3606×2715 image the WritePixel method would be called tens of millions of times so its implementation and the overall approach used for ProcessToTensor has a significant impact on performance.

YoloV8 Coprocessor application running on Nvidia Jetson Orin with a resized image

Resizing the images had a significant impact on performance on the development box and Nividia Jetson Orin. This will need some investigation to see how much reducing the resizing the images impacts on the performance and accuracy of the model.

The ProcessToTensor method has already had some performance optimisations which improved performance by roughly 20%. There have been discussions about optimising similar code e.g. Efficient Bitmap to OnnxRuntime Tensor in C#, and Efficient RGB Image to Tensor in dotnet which look applicable and these will be evaluated.

YoloV8 ONNX – Nvidia Jetson Orin Nano™ GPU TensorRT Inferencing

The Seeedstudio reComputer J3011 has two processors an ARM64 CPU and an Nividia Jetson Orin 8G. To speed up inferencing on the Nividia Jetson Orin 8G with TensorRT I built an Open Neural Network Exchange(ONNX) TensorRT Execution Provider.

Roboflow Universe Tennis Ball by Ugur ozdemir dataset

The Open Neural Network Exchange(ONNX) model used was trained on Roboflow Universe by Ugur ozdemir dataset which has 23696 images. The initial version of the TensorRT integration used the builder.UseTensorrt method of the IYoloV8Builder interface.

...
YoloV8Builder builder = new YoloV8Builder();

builder.UseOnnxModel(_applicationSettings.ModelPath);

if (_applicationSettings.UseTensorrt)
{
   Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss.fff} Using TensorRT");

   builder.UseTensorrt(_applicationSettings.DeviceId);
}
...

When the YoloV8.Coprocessor.Detect.Image application was configured to use the NVIDIA TensorRT Execution provider the average inference time was 58mSec but it took roughly 7 minutes to build and optimise the engine each time the application was run.

Generating the TensorRT engine every time the application is started

The TensorRT Execution provider has a number of configuration options but the IYoloV8Builder interface had to modified with UseCuda, UseRocm, UseTensorrt and UseTvm overloads implemented to allow additional configuration settings.

...
public class YoloV8Builder : IYoloV8Builder
{
...
    public IYoloV8Builder UseOnnxModel(BinarySelector model)
    {
        _model = model;

        return this;
    }

#if GPURELEASE
    public IYoloV8Builder UseCuda(int deviceId) => WithSessionOptions(SessionOptions.MakeSessionOptionWithCudaProvider(deviceId));

    public IYoloV8Builder UseCuda(OrtCUDAProviderOptions options) => WithSessionOptions(SessionOptions.MakeSessionOptionWithCudaProvider(options));

    public IYoloV8Builder UseRocm(int deviceId) => WithSessionOptions(SessionOptions.MakeSessionOptionWithRocmProvider(deviceId));
    
    // Couldn't test this don't have suitable hardware
    public IYoloV8Builder UseRocm(OrtROCMProviderOptions options) => WithSessionOptions(SessionOptions.MakeSessionOptionWithRocmProvider(options));

    public IYoloV8Builder UseTensorrt(int deviceId) => WithSessionOptions(SessionOptions.MakeSessionOptionWithTensorrtProvider(deviceId));

    public IYoloV8Builder UseTensorrt(OrtTensorRTProviderOptions options) => WithSessionOptions(SessionOptions.MakeSessionOptionWithTensorrtProvider(options));

    // Couldn't test this don't have suitable hardware
    public IYoloV8Builder UseTvm(string settings = "") => WithSessionOptions(SessionOptions.MakeSessionOptionWithTvmProvider(settings));
#endif
...
}

The trt_engine_cache_enable and trt_engine_cache_path TensorRT Execution provider session options configured the engine to be cached when it’s built for the first time so when a new inference session is created the engine can be loaded directly from disk.

...
YoloV8Builder builder = new YoloV8Builder();

builder.UseOnnxModel(_applicationSettings.ModelPath);

if (_applicationSettings.UseTensorrt)
{
   Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss.fff} Using TensorRT");

   OrtTensorRTProviderOptions tensorRToptions = new OrtTensorRTProviderOptions();

   Dictionary<string, string> optionKeyValuePairs = new Dictionary<string, string>();

   optionKeyValuePairs.Add("trt_engine_cache_enable", "1");
   optionKeyValuePairs.Add("trt_engine_cache_path", "enginecache/");

   tensorRToptions.UpdateOptions(optionKeyValuePairs);

   builder.UseTensorrt(tensorRToptions);
}
...

In order to validate that the loaded engine loaded from the trt_engine_cache_path is usable for the current inference, an engine profile is also cached and loaded along with engine

If current input shapes are in the range of the engine profile, the loaded engine can be safely used. If input shapes are out of range, the profile will be updated and the engine will be recreated based on the new profile.

Reusing the TensorRT engine built the first time the application is started

When the YoloV8.Coprocessor.Detect.Image application was configured to use NVIDIA TensorRT and the engine was cached the average inference time was 58mSec and the Build method took roughly 10sec to execute after the application had been run once.

trtexec console application output

The trtexec utility can “pre-generate” engines but there doesn’t appear a way to use them with the TensorRT Execution provider.

YoloV8 ONNX – Nvidia Jetson Orin Nano™ GPU CUDA Inferencing

The Seeedstudio reComputer J3011 has two processors an ARM64 CPU and an Nividia Jetson Orin 8G. To speed up inferencing with the Nividia Jetson Orin 8G with Compute Unified Device Architecture (CUDA) I built an Open Neural Network Exchange(ONNX) CUDA Execution Provider.

The Open Neural Network Exchange(ONNX) model used was trained on Roboflow Universe by Ugur ozdemir dataset which has 23696 images.

// load the app settings into configuration
var configuration = new ConfigurationBuilder()
      .AddJsonFile("appsettings.json", false, true)
.Build();

_applicationSettings = configuration.GetSection("ApplicationSettings").Get<Model.ApplicationSettings>();

Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss.fff} YoloV8 Model load: {_applicationSettings.ModelPath}");

YoloV8Builder builder = new YoloV8Builder();

builder.UseOnnxModel(_applicationSettings.ModelPath);

if (_applicationSettings.UseCuda)
{
   builder.UseCuda(_applicationSettings.DeviceId) ;
}

if (_applicationSettings.UseTensorrt)
{
   builder.UseTensorrt(_applicationSettings.DeviceId);
}

/*
builder.WithConfiguration(c =>
{
});
*/

/*
builder.WithSessionOptions(new Microsoft.ML.OnnxRuntime.SessionOptions()
{

});
*/

using (var image = await SixLabors.ImageSharp.Image.LoadAsync<Rgba32>(_applicationSettings.ImageInputPath))
using (var predictor = builder.Build())
{
   var result = await predictor.DetectAsync(image);

   Console.WriteLine();
   Console.WriteLine($"Speed: {result.Speed}");
   Console.WriteLine();

   foreach (var prediction in result.Boxes)
   {
      Console.WriteLine($" Class {prediction.Class} {(prediction.Confidence * 100.0):f1}% X:{prediction.Bounds.X} Y:{prediction.Bounds.Y} Width:{prediction.Bounds.Width} Height:{prediction.Bounds.Height}");
   }

   Console.WriteLine();

   Console.WriteLine($" {DateTime.UtcNow:yy-MM-dd HH:mm:ss.fff} Plot and save : {_applicationSettings.ImageOutputPath}");

   using (var imageOutput = await result.PlotImageAsync(image))
   {
      await imageOutput.SaveAsJpegAsync(_applicationSettings.ImageOutputPath);
   }
}

When configured to run the YoloV8.Coprocessor.Detect.Image on the ARM64 CPU the average inference time was 729 mSec.

The first time ran the YoloV8.Coprocessor.Detect.Image application configured to use CUDA for inferencing it failed badly.

The YoloV8.Coprocessor.Detect.Image application was then configured to use CUDA and the average inferencing time was 85mSec.

It took a couple of weeks to get the YoloV8.Coprocessor.Detect.Image application inferencing on the Nividia Jetson Orin 8G coprocessor and this will be covered in detail in another posts.