YoloV8 ONNX – Nvidia Jetson Orin Nano™ DenseTensor Performance

When running the YoloV8 Coprocessor demonstration on the Nividia Jetson Orin inferencing looked a bit odd, the dotted line wasn’t moving as fast as expected. To investigate this further I split the inferencing duration into pre-processing, inferencing and post-processing times. Inferencing and post-processing were “quick”, but pre-processing was taking longer than expected.

YoloV8 Coprocessor application running on Nvidia Jetson Orin

When I ran the demonstration Ultralytics YoloV8 object detection console application on my development desktop (13th Gen Intel(R) Core(TM) i7-13700 2.10 GHz with 32.0 GB) the pre-processing was much faster.

The much shorter pre-processing and longer inferencing durations were not a surprise as my development desktop does not have a Graphics Processing Unit(GPU)

Test image used for testing on Jetson device and development PC

The test image taken with my mobile was 3606×2715 pixels which was representative of the security cameras images to be processed by the solution.

Redgate ANTS Performance Profiler instrumentation of application execution

On my development box running the application with Redgate ANTS Performance Profiler highlighted that the Computnet YoloV8 code converting the image to a DenseTensor could be an issue.

 public static void ProcessToTensor(Image<Rgb24> image, Size modelSize, bool originalAspectRatio, DenseTensor<float> target, int batch)
 {
    var options = new ResizeOptions()
    {
       Size = modelSize,
       Mode = originalAspectRatio ? ResizeMode.Max : ResizeMode.Stretch,
    };

    var xPadding = (modelSize.Width - image.Width) / 2;
    var yPadding = (modelSize.Height - image.Height) / 2;

    var width = image.Width;
    var height = image.Height;

    // Pre-calculate strides for performance
    var strideBatchR = target.Strides[0] * batch + target.Strides[1] * 0;
    var strideBatchG = target.Strides[0] * batch + target.Strides[1] * 1;
    var strideBatchB = target.Strides[0] * batch + target.Strides[1] * 2;
    var strideY = target.Strides[2];
    var strideX = target.Strides[3];

    // Get a span of the whole tensor for fast access
    var tensorSpan = target.Buffer;

    // Try get continuous memory block of the entire image data
    if (image.DangerousTryGetSinglePixelMemory(out var memory))
    {
       Parallel.For(0, width * height, index =>
       {
             int x = index % width;
             int y = index / width;
             int tensorIndex = strideBatchR + strideY * (y + yPadding) + strideX * (x + xPadding);

             var pixel = memory.Span[index];
             WritePixel(tensorSpan.Span, tensorIndex, pixel, strideBatchR, strideBatchG, strideBatchB);
       });
    }
    else
    {
       Parallel.For(0, height, y =>
       {
             var rowSpan = image.DangerousGetPixelRowMemory(y).Span;
             int tensorYIndex = strideBatchR + strideY * (y + yPadding);

             for (int x = 0; x < width; x++)
             {
                int tensorIndex = tensorYIndex + strideX * (x + xPadding);
                var pixel = rowSpan[x];
                WritePixel(tensorSpan.Span, tensorIndex, pixel, strideBatchR, strideBatchG, strideBatchB);
             }
       });
    }
 }

 private static void WritePixel(Span<float> tensorSpan, int tensorIndex, Rgb24 pixel, int strideBatchR, int strideBatchG, int strideBatchB)
 {
    tensorSpan[tensorIndex] = pixel.R / 255f;
    tensorSpan[tensorIndex + strideBatchG - strideBatchR] = pixel.G / 255f;
    tensorSpan[tensorIndex + strideBatchB - strideBatchR] = pixel.B / 255f;
 }

For a 3606×2715 image the WritePixel method would be called tens of millions of times so its implementation and the overall approach used for ProcessToTensor has a significant impact on performance.

YoloV8 Coprocessor application running on Nvidia Jetson Orin with a resized image

Resizing the images had a significant impact on performance on the development box and Nividia Jetson Orin. This will need some investigation to see how much reducing the resizing the images impacts on the performance and accuracy of the model.

The ProcessToTensor method has already had some performance optimisations which improved performance by roughly 20%. There have been discussions about optimising similar code e.g. Efficient Bitmap to OnnxRuntime Tensor in C#, and Efficient RGB Image to Tensor in dotnet which look applicable and these will be evaluated.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.