Running reasoning LLMs like the DeepSeek-R1 70b 43G locally for private / offline / air-gapped / secure environments via Ollama

39 min readJan 28, 2025

This article details one option towards running LLM inference locally on your private single, double GPU or CPU combination. The focus is private, offline and air-gapped LLM inference in secure environments. No online LLMs are used except for the commercial OpenAI o1 pro and Google Gemini Advanced LLMs — for comparison.

I spent the weekend running the new DeepSeek R1 70b model that arrived last week so I could specifically compare its output with other models like the OpenAI o1 pro, Qwen2, Meta Llama 3.3, Mistral-large123b, Google Gemma 2 27B on my local NVIDIA and M4 Max GPUs. I have provided some detailed comparisons to the current state of the art OpenAI o1 pro $200/m model on a couple shallow/deep single requests.

Bill of Materials

download/install Ollama — or use llamma.cpp directly — https://github.com/ollama/ollama
download/run the deepseek R1 model
https://ollama.com/library/deepseek-r1:70b and https://huggingface.co/deepseek-ai/DeepSeek-R1
use a GPU setup of at least 48G for 70b down to 12G for 14b

If you wish to only run inference on the model, I would use Ollama as the llama.cpp wrapper I reviewed in the Feb 2024 article

Running the 70B LLaMA 2 LLM locally on Metal via llama.cpp on Mac Studio M2 Ultra

TLDR; GPU memory size is key to running large LLMs — Apple Silicon because of its unified memory allows for local…

obrienlabs.medium.com

For fine tuning and customization run the models via TensorFlow 2 or llama.cpp and download the models directly from Hugging Face using your token. For inference only or for quickstart testing — just use Ollama.

Note the performance tradeoffs using various types, numbers and configurations of GPUs in the April 2024 article

Running the larger Google Gemma 7B 35GB LLM for 7x Inference Performance gain

TLDR; Run Gemma 7B on a single GPU with over 40 GB VRAM — preferably an 80 GB H100, 40 GB A100 or a 48G RTX-A6000 —…

obrienlabs.medium.com

Using Ollama is very straightforward but you will want some serious VRAM (NVIDIA on Windows) or unified memory (Apple M4 Max on OSX). An alternative is to use CPU based RAM on windows — this will severely slow down tokens/sec.

The ideal setup for most models up to 70b is any combination of single or dual GPU that provides for 48G. This would include a workstation RTX-A6000 Ampere or RTX-6000 Ada or dual RTX-4090 Ada consumer cards for windows, or an M4 Max with at least 64G of ram (48G of 75% unified ram). Note: the 43G deepseek-r1:70b will run on a stock 48G M4 Max 5120 gpu core Macbook Pro 16 but you will need to shutdown everything else and run with 9G as the mac will use of 46G running the larger 70b model.

The challenge is asking the “right” questions/prompts — as in engaging the reasoning capabilities of the model — and not just asking for essentially a search and summary. The quality of the prompt engineering is key.

Run the models using Ollama

Model Analysis

see ongoing details on running deepseek-r1 against other models at https://github.com/ObrienlabsDev/blog/issues/100

RTX-A6000 48G card running deepseek-r1:70b

full saturation — 100% TDP, 300W, 104c, 45G VRAM

Use verbose when running to get stats — downloading the model on a standard 1G line should take ~5 min. After download optionally disconnect the network.

ollama run deepseek-r1:70b --verbose

This is a typical RTX-A6000 setup

Ollama will efficiently saturate your GPU or dual GPU and mix in CPU cycles as required

Dual RTX-4090 24+24=48G cards running deepseek-r1:70b

total duration:       2m44.9175002s
load duration:        10.5717ms
prompt eval count:    68 token(s)
prompt eval duration: 6ms
prompt eval rate:     11333.33 tokens/s
eval count:           2349 token(s)
eval duration:        2m44.9s
eval rate:            14.24 tokens/s

ael@14900c MINGW64 ~
$ nvidia-smi
Sat Jan 25 19:05:56 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 561.17                 Driver Version: 561.17         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000             WDDM  |   00000000:01:00.0 Off |                  Off |
| 77%   86C    P2            299W /  300W |   43980MiB /  49140MiB |     98%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      9388      C   ...uda_v12_avx\ollama_llama_server.exe      N/A      |
+-----------------------------------------------------------------------------------------+

total duration:       2m16.3354264s
load duration:        10.2679ms
prompt eval count:    68 token(s)
prompt eval duration: 320ms
prompt eval rate:     212.50 tokens/s
eval count:           2433 token(s)
eval duration:        2m15.999s
eval rate:            17.89 tokens/s
michael@13900b MINGW64 ~
$ nvidia-smi
Sat Jan 25 18:45:53 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 561.17                 Driver Version: 561.17         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090      WDDM  |   00000000:01:00.0 Off |                  Off |
| 30%   44C    P2            221W /  480W |   21230MiB /  24564MiB |     47%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090      WDDM  |   00000000:02:00.0  On |                  Off |
| 30%   41C    P2            223W /  480W |   21811MiB /  24564MiB |     44%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------++-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      7780      C   ...uda_v12_avx\ollama_llama_server.exe      N/A      |

Both GPUs are needed (no NVLink here so we use the 16 GB/s PCI bus instead of 112 GB/s NVLink) — 20+20g = 40g for 70b

50% GPU saturation and 15–25% TDP is common for non-optimized 16/16 or 32/32 grids — however it is the bus in this case for shared model vram across the 2 gpu’s. 50% memory bus load

Macbook Pro 16 M4 Max 40 SM (5120 core) 48G unified ram — deepseek-r1:70b

Ideally you will need at least 64G to run 70b on a mac — use the 32b for normal operations — however the 43G 70b is able to run on 48G mac.

>ollama run deepseek-r1:70b — verbose

pulling manifest pulling 4cd576d9aa16… 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 42 GB pulling 369ca498f347… 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 387 B pulling 6e4c38e1172f… 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.1 KB pulling f4d24e9138dd… 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 148 B pulling 5e9a45d7d8b9… 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 488 B verifying sha256 digest writing manifest success

Shallow prompt

>>> Can we go over creating a mandelbrot image using cuda c. I would like to fully utilize all the cores of the GPU. I am using specific GPUs from NVidia including the RTX-3500 Ada mobile with 5120 cores and the RTX-4090 with 16384 cores.

Answer

Okay, so I want to create a Mandelbrot image using CUDA C to fully utilize all the GPU cores on my RTX 3500 Ada (which has 5120 cores) and RTX 4090 (with 16384 cores). I’ve heard that GPUs are great for parallel processing, but I’m not exactly sure how to approach this.

First, I remember that the Mandelbrot set is calculated using a recursive formula. Each pixel in the image corresponds to a complex number c = x + yi. We iterate z = z² + c and check if |z| exceeds 2; if it
does, the point escapes the set. The number of iterations before escaping determines the color.

So, to parallelize this on the GPU, each pixel’s computation should be handled by a separate thread because GPUs have thousands of cores. That way, all pixels can be processed simultaneously, which should speed
things up a lot compared to doing it on the CPU.

I need to structure my CUDA kernel so that each thread calculates the number of iterations for its assigned pixel. Let me think about how to map 2D pixel coordinates to threads. I’ll probably use a 2D grid of
blocks and within each block, have 2D threads. So, in the kernel launch parameters, I’ll specify something like dim3(grid, block) where grid is based on the image dimensions divided by the block size.

Choosing the right block size is important for occupancy. Maybe starting with 16x16 or 32x32 blocks and see how it performs. Each thread will calculate its x and y coordinates using blockIdx and threadIdx
variables.

I should also consider using shared memory to optimize data access, but in this case, since each pixel’s computation is independent, maybe it’s not necessary unless I’m accessing some read-only data frequently.

Now, the kernel function itself. It needs to take parameters like image dimensions (width and height), the center coordinates (offset) to zoom into a specific area, and the zoom level. Then for each thread,
calculate x0 and y0 based on their position relative to the offset and zoom.

The iteration loop will run until it either reaches the maximum number of iterations or the magnitude exceeds 2. Each iteration updates z_real and z_imag using the formula z = z² + c.

I’ll need to store the result (number of iterations) in a global memory array, which is then copied back to the host for rendering into an image file.

On the host side, I’ll initialize the parameters, allocate device memory for the result array, launch the kernel with appropriate grid and block sizes, handle any potential CUDA errors, copy the results back to
the host, and then generate the image.

I’m a bit concerned about memory management. I need to make sure that the device arrays are properly allocated and freed. Also, performance-wise, if each thread does independent calculations without much shared
data, it should scale well with more cores. However, using too large of a grid might cause some overhead, so finding the right balance in block size is key.

Testing on both GPUs could show different performance characteristics because the RTX 4090 has more cores. I wonder if the same kernel will automatically utilize all available cores or if I need to adjust block
sizes based on the GPU’s architecture.

Another thing to consider is data type precision. Using float should be sufficient for most cases, but for deep zooms, maybe using double would be better to maintain precision, though that might slow things down
and reduce occupancy since doubles take up more memory.

I also need to convert the iteration counts into colors. A simple approach is to map iterations to grayscale, but for a nicer image, perhaps a color palette based on iteration count modulo some number for
different hues. Or maybe just using a logarithmic scale for brightness.

Wait, I should make sure that when calculating x0 and y0, the scaling factors are correct so that each pixel corresponds accurately to the zoomed area. The offset parameters (center_x and center_y) will shift
the view, allowing me to focus on different parts of the set.

I think I’ll structure the code as follows:

Host function initializes parameters: width, height, max iterations, center offsets, zoom level.
Allocate host and device memory for the result array which holds iteration counts per pixel.
Launch the kernel with a grid that covers all pixels, each thread handling one pixel.
Each thread calculates its x0 and y0 based on global thread indices.
Iterate z = z² + c until escape or max iterations reached.
After kernel completes, copy results back to host.
Generate image by mapping iteration counts to colors.

I need to handle CUDA errors properly, using functions like cudaGetLastError() after each API call and print any errors that occur. This will help debug issues if the kernel fails to launch or data doesn’t
transfer correctly.

Also, for timing, I might use CUDA events to measure how long the kernel takes on different GPUs, which can help in optimizing block sizes and understanding performance scaling with more cores.

In terms of fully utilizing all GPU cores, making sure that each thread is kept busy without too much idle time is crucial. That’s why choosing an appropriate grid and block configuration is important. Maybe
using a larger block size like 32x32 could be better than smaller ones as they provide more threads per block, which can help hide latency.

I also wonder about register usage. Each thread uses a few registers to hold z_real, z_imag, c_real, c_imag, etc. If too many registers are used, it might limit occupancy, so perhaps some variables could be
moved to shared memory if necessary, but in this case, since each thread’s calculations are independent, it might not help much.

Testing the code on both GPUs will show how well it scales. The RTX 4090 with more cores should compute the image faster than the RTX 3500 Ada, given similar block configurations.

I think I’ve covered most aspects now. Let me outline the steps in code.

To create a Mandelbrot set image using CUDA C and fully utilize all GPU cores on your NVidia GPUs (RTX-3500 Ada and RTX-4090), follow these structured steps:

1. Host Function Initialization

Define parameters: width, height of the image, maximum iterations for each pixel, center coordinates for zooming, and the zoom level.
Allocate memory on both host and device for storing iteration counts.

2. CUDA Kernel Launch

Set up a grid configuration where each thread handles one pixel. Use a 2D grid with blocks (e.g., 16x16 or 32x32) to cover all pixels efficiently.
Launch the kernel, ensuring it’s optimized for high occupancy on both GPU architectures.

3. Kernel Function

Each thread computes its assigned pixel’s coordinates using blockIdx and threadIdx.
Calculate x0 and y0 based on global thread indices, scaled by zoom level and offset to focus on specific areas.
Iterate z = z² + c until the magnitude exceeds 2 or max iterations are reached. Store the iteration count in a global memory array.

4. Data Transfer and Error Handling

After kernel execution, transfer results from device to host memory.
Use CUDA functions to check for errors during memory allocation, kernel launch, and data transfer.

5. Image Generation

Convert iteration counts into color values. You can use a grayscale map or more complex color palettes based on iteration modulo.
Save the resulting image as a bitmap or another format.

6. Performance Optimization

Adjust block sizes to maximize GPU core utilization. Consider using CUDA events for timing and optimizing grid configurations.
Ensure efficient memory management to prevent bottlenecks, focusing on minimizing global memory accesses.

7. Testing Across GPUs

Compare performance between RTX-3500 Ada (5120 cores) and RTX-4090 (16384 cores). Expect faster processing times on the RTX-4090 due to higher core count when using similar configurations.
Adjust block sizes if necessary for different architectures, though the same kernel should scale well without changes.

Sample CUDA C Code Outline

#include <cuda_runtime.h>
#include <device_launch_parameters.h>

// Kernel function
__global__
void mandelbrotKernel(int* d_result, int width, int height, 
                      int max_iter, double center_x, double center_y, double zoom) {
    // Thread indices
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x >= width || y >= height) return;
    // Calculate c for this pixel
    double x0 = (x - width / 2.0) * zoom + center_x;
    double y0 = (y - height / 2.0) * zoom + center_y;
    // Complex number z and c
    double z_real = 0.0, z_imag = 0.0;
    double c_real = x0, c_imag = y0;
    int iter = 0;
    while (iter < max_iter && (z_real*z_real + z_imag*z_imag) <= 4.0) {
        // Compute next iteration
        double temp = z_real * z_real - z_imag * z_imag + c_real;
        z_imag = 2.0 * z_real * z_imag + c_imag;
        z_real = temp;
        iter++;
    }
    d_result[y * width + x] = iter;
}
int main() {
    // Image dimensions
    const int width = 1024, height = 768;
    const int max_iter = 255;
    // Memory allocation
    size_t bytes = width * height * sizeof(int);
    int* h_result; cudaMallocHost(&h_result, bytes);
    int* d_result; cudaMalloc(&d_result, bytes);
    // Grid and block configuration
    int blockSize = 16;
    dim3 blockSize2D(blockSize, blockSize, 1);
    dim3 gridSize2D(ceil(width / blockSize), ceil(height / blockSize), 1);
    // Kernel parameters
    double center_x = -0.5, center_y = 0.0;
    double zoom = 200.0;
    // Launch kernel
    mandelbrotKernel<<<gridSize2D, blockSize2D>>>(d_result, width, height, max_iter,
                                                 center_x, center_y, zoom);
    // Synchronize and check for errors
    cudaDeviceSynchronize();
    cudaError_t error = cudaGetLastError();
    if (error != cudaSuccess) {
        printf("CUDA Error: %s\n", cudaGetErrorString(error));
    }
    // Transfer results to host
    cudaMemcpy(h_result, d_result, bytes, cudaMemcpyDeviceToHost);
    // Free memory
    cudaFree(d_result);
    cudaFreeHost(h_result);
    return 0;
}

Explanation

This code efficiently computes the Mandelbrot set using CUDA. Each pixel’s iteration count is calculated on the GPU, ensuring full utilization of all cores for optimal performance. The kernel uses block-based
parallelism to process pixels concurrently, leveraging the massive parallel processing capabilities of NVidia GPUs. The result is then transferred back to the host and can be saved as an image for visualization.

Advice

Memory Management: Use cudaMallocHost for pinned memory on the host for faster data transfer.
Error Checking: Implement error handling after each CUDA function call to catch issues early.
Performance Tuning: Experiment with different block sizes (e.g., 8x8, 16x16, 32x32) to find optimal performance for your specific GPU architecture.
Image Conversion: After obtaining the iteration counts, convert them into an image format using libraries like OpenCV or custom code.

By following these steps and optimizing as needed, you can create high-resolution Mandelbrot set images efficiently on both RTX-3500 Ada and RTX-4090 GPUs.

43G VRAM
total duration: 8m1.026498917s
load duration: 33.207792ms
prompt eval count: 69 token(s)
prompt eval duration: 1m26.968s
prompt eval rate: 0.79 tokens/s
eval count: 2510 token(s)
eval duration: 6m34.022s
eval rate: 6.37 tokens/s

RTX-4500 dual deepseek-r1:32b model

Note: 1G CPU RAM is only used to run ollama — not the model which in this case for 32b model takes 11 and 12 = 23G VRAM across the two RTX-A4500 20G GPUs

Windows 13800H single RTX-3500 Ada 12G VRAM — deepseek-r1:14b

Run the 14b model on smaller mobile GPUs

9G
total duration:       44.1528736s
load duration:        11.0044ms
prompt eval count:    1286 token(s)
prompt eval duration: 1.185s
prompt eval rate:     1085.23 tokens/s
eval count:           1253 token(s)
eval duration:        42.934s
eval rate:            29.18 tokens/s

Adjacent LLM Models

According to deepseek, the Qwen model from Alibaba was part of their work. I ran the Qwen2 model to see what the larger 72b-instuct-fp16 145G model would output for comparison. This very large model will use some of the available 48G VRAM with 102G on the CPU RAM — and will run — as expected very slow on the CPU

qwen2–72b-instruct-fp16 on RTX-4090 dual 24+24=48G and 192GB cpu ram

145g model runs in 102G cpu ram and 20, 22G VRAM — expected to be very slow as we are working across the PCIe x8 shared bus

$ nvidia-smi
Sun Jan 26 00:43:01 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 561.17                 Driver Version: 561.17         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090      WDDM  |   00000000:01:00.0 Off |                  Off |
|  0%   49C    P3             51W /  480W |   21600MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090      WDDM  |   00000000:02:00.0  On |                  Off |
|  0%   51C    P3             50W /  480W |   20103MiB /  24564MiB |     22%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      7380      C   ...uda_v12_avx\ollama_llama_server.exe      N/A      |

Appendix A:

OpenAI o1 pro $300CAD/m comparison

Prompt 1: shallow ask from above

Can we go over creating a mandelbrot image using cuda c. I would like to fully utilize all the cores of the GPU. I am using specific GPUs from NVidia including the RTX-3500 Ada mobile with 5120 cores and the RTX-4090 with 16384 cores.

thinking…4m 0s

Utilizing the full GPU potential

I’m exploring how to optimize a CUDA C program to leverage numerous GPU cores, like the RTX-3500 Ada and RTX-4090, for generating images such as Mandelbrot sets.

Crafting the iteration function

I’m detailing the creation of a function to iterate Mandelbrot set calculations for each pixel, scaling coordinates, applying color schemes based on iteration counts, and defining the iteration formula and limits.

Breaking down GPU tasks

I’m setting up a sample CUDA program, optimizing GPU resources with 2D block configurations, and preparing data for storage in a .ppm file.

Optimizing GPU usage

I’m focusing on parallel iteration computations, leveraging GPU cores efficiently, and exploring various optimization techniques and precision options to enhance fractal generation.

Prompt 2: deeper ask

Can we go over creating a mandelbrot image set using CUDA C — the 12.6 version for compute capability 8.6 to 8.9. I would like to fully utilize all the SM cores of the GPU. I am using specific GPUs from NVIDIA including the RTX-3500 Ada mobile with 5120 cores, the RTX-A6000, a dual RTX-A4500 with NVLink and a dual RTX-4090 with 32768 cores spread across both GPUs. I would like the FP precision to be as high as possible to allow for iteration of image generation down to very low levels of magnitude. If possible suggest optimizations for each specific card and compute capability so that the optimum FP precision is used whether FP16 or FP64. The iteration level should be customizable to be able to handle up to 24bit color precision. Suggest a method we can use to allow for persistence of the resulting images or raster data without impacting the image generation pipelining and throttling the GPU saturation — by minimizing file system writes or network based database storage. For reference the CPU bus is PCIe x16 for single GPUs and PCI x8 for dual gpu setups. The storage system is 1 to 5 GB/s NVMe drives. The primary C++ IDE is Visual Studio 2022.

Thinking….7m 6s

Detailing GPU and precision requirements

I’m noting the compute capability of 8.6 or 8.9 with CUDA 12.6, targeting GPUs like RTX-3500 and RTX-4090, aiming for maximum precision and efficient GPU pipeline use.

Innovative GPU strategies

I’m exploring asynchronous writes, pinned memory, and separate streams for data extraction while computing. Leveraging concurrency, streams, and multi-GPU configurations expands the options.

Balancing precision needs

I’m considering the A4500 and A6000 GPUs for double precision. Using CUDA, I’ll carefully manage concurrency, load data, run parallel iterations, and asynchronously transfer results.

Selecting precise methods

I’m planning to use a pinned memory approach for storing results in standard image formats. Handling multi-GPU setups with CUDA Multi-Device or NVLink will be key.

Output….

DeepSeek-R1:70b

Prompt 1: shallow ask

PS C:\Users\michael> ollama run deepseek-r1:70b — verbose
>>>
>>> Can we go over creating a mandelbrot image using cuda c. I would like to fully utilize all the cores o
… f the GPU. I am using specific GPUs from NVidia including the RTX-3500 Ada mobile with 5120 cores and
… the RTX-4090 with 16384 cores.
<think>
Alright, so I’m trying to create a Mandelbrot image using CUDA C. The user wants to fully utilize all
the GPU cores they have, specifically on an RTX 3500 Ada with 5120 cores and an RTX 4090 with 16384
cores. Hmm, okay, let me think through how to approach this.

First, I remember that the Mandelbrot set is calculated by iterating a function for each point in the
complex plane and checking if it diverges. Each pixel corresponds to a point, so we’ll need to
compute this for every pixel in the image. Using CUDA makes sense here because GPUs can parallelize
these computations across many cores.

So, I should start with the basic structure of a CUDA program. That includes initializing the device,
allocating memory on both host and device, and then transferring data between them. The main task is
to compute the Mandelbrot set for each pixel, which will be done in a kernel function.

Wait, how do I map pixels to threads? Each thread should handle one pixel, right? So, the kernel
needs to calculate the number of iterations for that specific pixel’s point. I’ll probably launch a
grid where each block has multiple threads — maybe 16x16 since that’s a common choice for CUDA. But I’m
not sure if that’s optimal for all GPUs, especially with varying core counts.

Let me outline the steps:

1. **Allocate Memory**: On both host (CPU) and device (GPU). The host will hold the final image data,
which is an array of floats or something similar representing each pixel’s iteration count.

2. **Launch Kernel**: The kernel function will compute the Mandelbrot iterations for each pixel. Each
thread in the kernel handles one pixel. So, I need to set up a grid where each block has enough
threads to cover all pixels efficiently.

3. **Memory Transfer**: After computation on GPU, transfer the results back to host memory and then
save it as an image file.

Now, considering performance: Since the user wants full utilization of GPU cores, I should maximize parallelism. Using a 2D grid might help, where each block covers a small part of the image. Maybe
16x16 blocks with each block having 256 threads or something like that? Wait, no — each thread is
responsible for one pixel. So if my image is, say, 1024x1024 pixels, I’d need 1,048,576 threads. But
how to arrange them into blocks and grids.

I think the standard approach is to have each block handle a square of pixels, like 16x16, which
gives 256 threads per block. Then, for a 1024x1024 image, you’d have (1024/16) x (1024/16) = 64x64
blocks in total. That’s 4096 blocks, each with 256 threads, totaling over a million threads, which
should keep the GPU busy.

But wait, different GPUs might handle block sizes differently. The RTX 4090 has more cores than the
3500 Ada, so maybe using larger blocks would be better for it? Or perhaps stick to a standard that
works well across both cards.

The kernel function itself needs to calculate the Mandelbrot iteration count for each pixel. Each
pixel’s (x,y) corresponds to a complex number c = x + yi. The algorithm iterates z = z² + c,
starting from z=0. If |z| exceeds 2, the point is in the set and we stop iterating.

I’ll have to map the pixel indices to the actual complex coordinates. That involves setting up the
boundaries — like the zoom level, center, and width. For example, if I’m rendering a view centered at
(centerX, centerY) with a certain width and height, each pixel’s position can be calculated
accordingly.

In the kernel function, each thread will compute its x and y positions based on its block and thread
indices. Then, it’ll map these to the complex plane coordinates and perform the iterations. The
result is stored in the global memory array at the corresponding pixel index.

I also need to optimize memory access patterns. Since threads in a block can cooperate, maybe using
shared memory could help, but I’m not sure if it’s necessary here because each thread is handling an
independent calculation. So perhaps just writing directly to global memory is fine as long as we have
coalesced accesses.

Wait, how are the pixel indices mapped? If each row is contiguous in memory, then consecutive threads
in a row should access consecutive memory addresses, which is good for coalescing. So arranging the
blocks and grid so that nearby threads handle nearby pixels would help with memory bandwidth.

After computing all pixels, I’ll transfer the data back to host, normalize it to an 8-bit grayscale
image (since higher iteration counts can be mapped to different colors), and then write it out as a
PPM or PNG file. For simplicity, maybe just a P6 PPM where each pixel is represented by three bytes
for RGB, but since it’s grayscale, all components are the same.

Wait, but in my initial plan, I’m storing iteration counts as floats. When writing to an image, I can
scale these values to 0–255. Alternatively, perhaps using a log scale might give better visual
results because the iteration counts grow exponentially. That might help in making the image look
more detailed and less flat.

Another consideration is precision. Using single-precision floats (float) should be sufficient for
most purposes, but if high zoom levels are needed, double precision might be necessary to avoid
errors due to limited precision. However, on some GPUs, especially consumer-grade ones, double
precision might not be as fast or supported as well. So maybe stick with float unless the user needs
extreme zoom levels.

I should also think about the number of iterations (max_iter). The higher it is, the more detailed
the image but at the cost of computation time. For a balance between quality and performance, 1000
iterations might be a good starting point.

Now, putting this all together into code:

- Host code: Initialize device, allocate host and device memory, set up parameters (centerX, centerY,
width), compute scaling factors to map pixel indices to complex coordinates.
- Kernel function: Each thread computes the iteration count for its assigned pixel.
- Transfer data back and write to image.

Wait, I might have missed the part where each thread in the kernel knows which pixel it’s responsible
for. So, in CUDA, each thread can calculate its global index using blockIdx and threadIdx variables.
For a 2D grid, we can compute a unique index per thread as:

int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

But wait, if the image is 2D, I need to flatten this into a single index for the array. So, perhaps
something like:

int pixelIndex = idy * width + idx;

Wait no, more accurately, each block has (blockDim.x, blockDim.y) threads, so the global thread
indices would be:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;

Then, if the image is of size height x width, the pixel index could be computed as y * width + x,
assuming that each row has ‘width’ pixels and there are ‘height’ rows.

Wait, but in CUDA, it’s more common to have the kernel parameters as (int* output) where output is an
array of floats, with size height*width. So, for a 2D grid, each thread computes its x and y position
within the image, then maps that to the complex plane coordinates, computes iterations, and stores
the result in output[y * width + x].

Yes, that makes sense.

So, putting it all together:

Host code will set up the device, allocate memory for the output array (size height*width), define
kernel launch parameters (block size, grid size), then call the kernel. After the kernel finishes,
transfer data back to host, process into RGB values, and write to PPM.

Wait, but in the initial response, I didn’t handle the image writing part beyond computing the
iteration counts. Maybe it’s better to include that as well for completeness.

Also, when setting up the kernel parameters, I need to ensure that all pixels are covered by threads.
So if the image dimensions aren’t exact multiples of block size, some blocks will have fewer threads
in their last row or column. Alternatively, using a 1D mapping with each thread handling one pixel
might be simpler, but 2D can offer better memory coalescing.

Hmm, perhaps starting with a 1D approach where each thread handles a single pixel, and the grid is 1D
as well, but that might not be optimal. Alternatively, using a 2D block layout could improve memory
access patterns because adjacent threads in a row would access consecutive memory locations, which is
better for coalescing.

Wait, let me think again: If I have an image of size width x height, and each thread handles one
pixel, then the total number of threads per grid should be width * height. So arranging this into
blocks of (16x16) gives a manageable number. For example, if the image is 1024x768, then:

blockDim = dim3(16,16,1);
gridDim = dim3((width + 15)/16, (height + 15)/16, 1);

But wait, that might not be exactly correct because (width + 15) / 16 ensures rounding up.
Alternatively, in code, gridDim.x = (width + blockDim.x — 1) / blockDim.x; and similarly for y.

Yes, so the kernel would look something like:

__global__ void computeIterations(float* output, float centerX, float centerY, float scale, int
width, int height, int max_iter) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= width || y >= height) return; // Guard against out-of-bounds
// Convert pixel coordinate to complex number
float c_re = centerX + (x — width/2) * scale;
float c_im = centerY + (y — height/2) * scale;
float z_re = 0.0f;
float z_im = 0.0f;
int iter = 0;
while (iter < max_iter && (z_re*z_re + z_im*z_im) <= 4.0f) {
// Compute next iteration
float temp = z_re * z_re — z_im * z_im + c_re;
z_im = 2 * z_re * z_im + c_im;
z_re = temp; // Wait, no: the correct calculation is:
// new_z_re = z_re² — z_im² + c_re
// new_z_im = 2*z_re*z_im + c_im
// So in code:
float next_z_re = z_re * z_re — z_im * z_im + c_re;
float next_z_im = 2 * z_re * z_im + c_im;
z_re = next_z_re;
z_im = next_z_im;
iter++;
}
// Store the result in output array
int idx = y * width + x; // Assuming row-major order, height rows each of width elements
output[idx] = (float)iter / max_iter; // Normalize to 0–1 range for easier image mapping later
}

Wait a second, I think I messed up the computation in the loop. The correct update is:

temp = z_re² — z_im² + c_re;
z_im_new = 2*z_re*z_im + c_im;

Then set z_re = temp and z_im = z_im_new.

So in code, it should be:

float next_z_re = z_re * z_re — z_im * z_im + c_re;
float next_z_im = 2 * z_re * z_im + c_im;
z_re = next_z_re;
z_im = next_z_im;

But wait, that’s not correct because using the updated z_im would affect subsequent calculations.
Wait no: in each iteration, we compute the next z based on current z. So the calculation is:

float temp_re = z_re * z_re — z_im * z_im + c_re;
float temp_im = 2 * z_re * z_im + c_im;
z_re = temp_re;
z_im = temp_im;

But wait, actually, in correct code, we need to compute both next_z_re and next_z_im before
updating z_re and z_im. So the above is correct.

In any case, each thread will loop up to max_iter times, which could be computationally intensive
for large images or high max_iter values, but that’s inherent to generating fractal images.

On the host side, after getting the output array with iteration counts normalized (e.g., 0.0f
being divergence at first step, 1.0f being maximum iterations without divergence), I can then map
these values into RGB colors.

For example:

- For each pixel value f in [0,1], map to grayscale by using (int)(f * 255) for each of R, G, B
components.
- Or apply a color palette for more interesting visuals, but that’s beyond the initial scope.

But perhaps including this part would make the response more complete. So after transferring data
back to host, I can process it into an image.

In code:

float* d_output;
cudaMalloc(&d_output, width * height * sizeof(float));
// Kernel launch
computeIterations<<<gridDim, blockDim>>>(d_output, centerX, centerY, scale, width, height,
max_iter);
// Transfer back to host
float* h_output = (float*)malloc(width * height * sizeof(float));
cudaMemcpy(h_output, d_output, width * height * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_output);

Then, create a PPM image.

Alternatively, since this is getting lengthy, perhaps the initial response only includes up to
computing iteration counts and not the image generation part, which might be acceptable.

Another consideration: The initial response uses 1D indexing for the output array. If I change it
to a 2D grid, does that affect anything? It’s more about how the data is stored in memory;
row-major order is standard, so storing as a 1D array where each row follows the previous is
correct.

Also, the computation of ‘scale’ needs to be accurate. For example, scale = zoom / (0.5 * width),
assuming that the window is scaled such that each pixel represents a certain step in the complex
plane.

Wait, perhaps I should compute scale as 4.0f / width * zoom; but actually, it’s better to compute
based on how much of the fractal we’re viewing. For example:

If we want the image to be centered at (centerX, centerY) and have a certain width in the complex
plane, then scale = width_complex / width_image.

But perhaps for simplicity, let me just define scale as 3.5f / width; which would mean that the
entire image covers from centerX — scale*(width/2) to centerX + scale*(width/2), similarly for
y-axis.

Alternatively, in code:

float scale = zoomFactor / (0.5 * width);

But maybe I should think about it more carefully. Suppose we have an image of width W and height
H. The complex plane is mapped such that the center is at (centerX, centerY). Each pixel’s
position corresponds to a point c in the complex plane.

The scale would determine how much each pixel represents in terms of the complex coordinates. For
example:

c_re = centerX + (x — W/2) * scale

Similarly for c_im. So scale determines how ‘zoomed’ the view is. A smaller scale means more
zoomed out, showing a larger area.

Wait no: actually, if we increase scale, each pixel represents a larger step in the complex plane,
which would mean that the image appears more zoomed in because you’re covering less of the
fractal’s extent but with higher detail per pixel.

Wait, perhaps I have it inverted. Let me think: If I set scale as 3.5 / width, then each pixel
represents a step of (3.5/width) units in both x and y directions. So if width is larger, the
scale becomes smaller, meaning that each pixel covers less of the complex plane, thus showing more
detail — so higher zoom.

Wait no: Let me think with numbers. Suppose width is 1000 pixels. If scale is 3.5 / 1000 = 0.0035
units per pixel. So from the center, moving 500 pixels left would be -0.0035 * 500 = -1.75, and
right 500 pixels +1.75.

So if I have a smaller width (e.g., 100 pixels), scale becomes larger: 3.5 / 100 = 0.035 per
pixel. So moving 50 pixels left would be -0.035 *50= -1.75 as well. Wait, same coverage but with
fewer pixels. Hmm.

Wait, perhaps the correct way is to have:

scale = (zoom) / width;

So that when you increase zoom, each pixel covers a smaller area (more zoomed in), or when width
increases, scale decreases (so more of the fractal is visible).

Alternatively, perhaps better to compute based on both width and height if aspect ratio matters.

But for simplicity, let’s assume we’re working with square images where width equals height. Then:

scale = 3.5f / width;

Thus, the image covers from centerX — (width*scale)/2 to centerX + (width*scale)/2 in the real
axis, similarly for imaginary.

This way, when width increases, scale decreases, so more of the fractal is visible — each pixel
represents a smaller step, hence less zoomed in.

But perhaps this can be adjusted based on specific needs. In any case, the exact computation may
vary depending on how the user wants to set up their view.

Putting it all together:

In the host code, we’d have something like:

int width = 800;
int height = 800;
float centerX = -0.5f;
float centerY = 0.0f;
float scale = 3.5f / width;

Then call the kernel with these parameters.

Now, considering all this, the initial response provided seems correct, but I might have missed
some steps or made errors in variable names or computations.

Another point: In CUDA kernels, using float precision is acceptable for most cases unless higher
precision is needed, which isn’t typically necessary for generating fractal images. So using
floats should be fine.

Potential optimizations:

1. Using shared memory to cache frequently accessed data, but in this case, each thread writes to
a unique output element, so shared memory may not provide much benefit.
2. Loop unrolling: Since the loop is fixed up to max_iter, and each iteration is simple arithmetic
operations, perhaps it’s already efficient enough without further optimizations.

Testing considerations:

- Ensure that for some known values of c (e.g., c=0), the correct number of iterations are
computed before reaching the bailout radius.
- Test edge cases like when c is at the center or towards the edges to ensure scale and
positioning are accurate.
- Verify that changing centerX, centerY, and scale correctly shifts the viewed portion of the
fractal.

Possible extensions:

- Implementing different color maps or algorithms for determining pixel colors based on iteration
counts.
- Adding parameters to control the maximum number of iterations (depth) and bailout radius.
- Allowing for interactive zooming and panning by adjusting centerX, centerY, and scale
dynamically in response to user input.
- Experimenting with different fractal types like Julia sets or other escape-time fractals.

In conclusion, yes, we can create a CUDA kernel to compute the Mandelbrot set. The process
involves mapping each pixel to a complex number c, iterating the quadratic function f(z) = z² +
c, and determining how quickly |z| exceeds 2 (the bailout radius). Based on the iteration count,
assign colors or shades to each pixel to form the fractal image.
</think>

To compute the Mandelbrot set using CUDA, follow these steps:

1. **Set Parameters**: Define the dimensions of your output image (width and height), choose a
center point (centerX, centerY) in the complex plane, and determine the scale which controls the
zoom level.

2. **Initialize Output Array**: Create an array to store the iteration counts for each pixel.

3. **CUDA Kernel**: Write a kernel that maps each thread to a pixel, computes the corresponding
complex number c, iterates the quadratic function f(z) = z² + c, and records how many iterations
it takes for |z| to exceed 2 (bailout radius).

4. **Memory Management**: Allocate memory on both host and device, copy necessary data to the GPU.

5. **Launch Kernel**: Invoke the kernel with appropriate grid and block dimensions.

6. **Post-Processing**: After computation, transfer results back to the host and generate an image
based on iteration counts.

Here’s a sample CUDA code implementing these steps:

```c
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <iostream>
// Define constants
const int width = 800;
const int height = 800;
const float centerX = -0.5f;
const float centerY = 0.0f;
const float scale = 3.5f / width;
__global__
void mandelbrotKernel(unsigned char* d_output) {
// Calculate global thread ID
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= width || y >= height) return;
// Convert pixel coordinate to complex number c
float cReal = centerX + (x — width / 2) * scale;
float cImag = centerY + (y — height / 2) * scale;
float zReal = 0.0f;
float zImag = 0.0f;
unsigned char iteration = 0;
const unsigned char maxIter = 255; // Maximum iterations
while (iteration < maxIter && (zReal*zReal + zImag*zImag) <= 4.0f) {
float temp = zReal * zReal — zImag * zImag + cReal;
zImag = 2.0f * zReal * zImag + cImag;
zReal = temp;
iteration++;
}
// Store the result
d_output[y * width + x] = iteration;
}
int main() {
// Allocate memory on device
unsigned char* d_output;
cudaMalloc((void**)&d_output, width * height);
// Kernel launch parameters
int blockSize = 16;
dim3 block(blockSize, blockSize);
dim3 grid((width + blockSize — 1) / blockSize, (height + blockSize — 1) / blockSize);
// Launch the kernel
mandelbrotKernel<<<grid, block>>>(d_output);
cudaDeviceSynchronize();
// Allocate host memory and copy results back
unsigned char* h_output = new unsigned char[width * height];
cudaMemcpy(h_output, d_output, width * height, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_output);
// Process output data (e.g., save to an image)
// Replace this with your image generation code
for (int y = 0; y < height; ++y) {
for (int x = 0; x < width; ++x) {
int idx = y * width + x;
std::cout << “Pixel (“ << x << “,” << y << “) Iteration: “ <<
static_cast<int>(h_output[idx]) << std::endl;
}
}
delete[] h_output;
return 0;
}
```

**Explanation**:

- **Parameters**: Adjust `width`, `height`, `centerX`, `centerY`, and `scale` to zoom into
different parts of the Mandelbrot set.

- **Kernel Function**: Each thread calculates the complex number c for its corresponding pixel,
iterates the function z = z² + c, and counts iterations until |z| > 2 or maxIter is reached.

- **Memory Management**: Allocates memory on both host and device, copies results after
computation.

- **Image Generation**: The code includes placeholders where you can add functionality to convert
iteration counts into an image format like ppm or bmp.

To compile and run this code, ensure CUDA toolkit is installed. Use a command like:

```bash
nvcc -o mandelbrot mandelbrot.cu
./mandelbrot
```

This will compute the Mandelbrot set and display iteration counts for each pixel. For
visualization, integrate with an image library or framework to generate an actual image from
`h_output` data.

total duration: 6m16.4906879s
load duration: 11.4388ms
prompt eval count: 68 token(s)
prompt eval duration: 572ms
prompt eval rate: 118.88 tokens/s
eval count: 5278 token(s)
eval duration: 6m15.906s
eval rate: 14.04 tokens/s

Prompt 2: deeper ask

(no code this time)

>>> Can we go over creating a mandelbrot image set using CUDA C — the 12.6 version for compute capabilit
… y 8.6 to 8.9. I would like to fully utilize all the SM cores of the GPU. I am using specific GPUs fr
… om NVIDIA including the RTX-3500 Ada mobile with 5120 cores, the RTX-A6000, a dual RTX-A4500 with NV
… Link and a dual RTX-4090 with 32768 cores spread across both GPUs. I would like the FP precision to
… be as high as possible to allow for iteration of image generation down to very low levels of magnitu
… de. If possible suggest optimizations for each specific card and compute capability so that the opti
… mum FP precision is used whether FP16 or FP64. The iteration level should be customizable to be able
… to handle up to 24bit color precision. Suggest a method we can use to allow for persistence of the
… resulting images or raster data without impacting the image generation pipelining and throttling the
… GPU saturation — by minimizing file system writes or network based database storage. For reference
… the CPU bus is PCIe x16 for single GPUs and PCI x8 for dual gpu setups. The storage system is 1 to 5
… GB/s NVMe drives. The primary C++ IDE is Visual Studio 2022.
<think>
Okay, so I’m trying to create a Mandelbrot image set using CUDA C version 12.6 targeting compute
capability 8.6 to 8.9. My goal is to fully utilize all the SM cores on my GPUs, which include some
pretty powerful ones like the RTX 3500 Ada mobile with 5120 cores, RTX A6000, dual RTX A4500 with
NVLink, and even a dual RTX 4090 with 32768 cores. I want the highest possible FP precision so I
can iterate down to very low magnitudes in my images. I also need this setup to handle up to
24-bit color precision. Plus, I don’t want to slow down the GPU by writing files too often; I need
a way to store the raster data without impacting performance.

First, I should figure out how to set up CUDA with these specific GPUs. Since they’re all based on
compute capability 8.x, using CUDA 12.6 makes sense because it supports those architectures. I’ll
probably start by setting up my project in Visual Studio 2022 with the CUDA toolkit installed. I
remember that each GPU has different numbers of SMs and cores per SM, so optimizing for each is
key.

I know that FP64 precision requires more resources than FP32 or FP16, but since I need high
precision for deep zoom levels, maybe using FP64 is necessary. However, some of my GPUs might not
have as much support for FP64 operations. For example, the RTX 4090 has better Tensor Core support
which might help with FP16 and mixed-precision computations. So perhaps a hybrid approach where I
use FP32 for most calculations but switch to FP64 when needed.

The Mandelbrot set calculation involves iterating the function f(z) = z² + c until it diverges or
reaches max iterations. To parallelize this, each pixel can be handled by a CUDA thread. So I’ll
need a kernel that takes in parameters like image dimensions, center coordinates, zoom level, and
max iterations.

I should think about memory management. Using device arrays to store the iteration counts makes
sense because it avoids host-device transfers which are slow. But I have to allocate enough memory
for each pixel’s data. Since my images could be large (like 16k x 16k), I need to ensure I’m not
running out of VRAM, especially when dealing with FP64 which takes more space.

Coalesced memory access is important for performance. If I organize the pixels in a way that
threads can access contiguous blocks efficiently, it should speed things up. Maybe using a 2D grid
where each block handles a tile of pixels could help minimize global memory accesses and improve
cache utilization.

Occupancy is another factor. I want as many active warps per SM as possible to hide latency. For
FP64-heavy kernels, maybe reducing the number of threads per block can help because FP64 uses more
registers. Alternatively, using mixed precision might allow for better occupancy since Tensor
Cores on newer GPUs handle FP16 efficiently.

I also need a way to visualize or store these images without slowing down the GPU. Writing each
frame directly to disk as it’s generated could be too slow, especially with multiple GPUs.
Instead, maybe I can use a double buffering approach where one buffer is being written while the
other is processed. Alternatively, using a producer-consumer queue model where the GPU generates
frames into buffers and a separate thread handles writing them to storage when ready.

For dual-GPU setups like the RTX A4500 with NVLink or dual RTX 4090, I should consider whether to
split the workload across both GPUs or have each handle its own image. Peer-to-peer memory access
over NVLink is faster than going through the CPU, so if I’m generating images in parallel on each
GPU, that might be more efficient.

Optimizing for each specific card: RTX 3500 Ada mobile has fewer cores but high clock speeds, so
maybe it’s better suited for FP32 tasks. The RTX A6000 and dual A4500 have more cores and NVLink,
which is good for split workloads or higher precision tasks. Dual RTX 4090s with lots of Tensor
Cores might benefit from using mixed-precision computations with FP16 for speed.

Testing different data types on each GPU to see where performance breaks down would be important.
Maybe start with FP32 and then switch to FP64 only when necessary, but if the loss of precision
isn’t noticeable at lower zoom levels, that could save resources.

I also need to make sure I’m using the right libraries for image handling. Since 24-bit color
requires storing each pixel as three bytes (RGB), perhaps using a struct or a packed format would
help manage memory efficiently on the GPU and during transfer.

In terms of storage, since my NVMe drives are up to 5 GB/s, which is fast but not infinitely so, I
need to batch writes. Maybe accumulate several frames in device memory before transferring them
over the PCIe bus and writing to disk. Using asynchronous memory transfers with CUDA streams could
help here so that the GPU isn’t waiting for the data to be written.

I also wonder about using compression. If the raster data is compressed on the GPU before
transfer, it might reduce the amount of data being sent over the bus or stored. But that would add
computational overhead which might negate the benefits unless done efficiently in hardware.

For handling multiple GPUs, I should use CUDA’s multi-GPU functions. Maybe assign each GPU a
portion of the image or have them work on different images in parallel if generating a set.
Synchronizing access to data structures is important to avoid race conditions, so using atomic
operations where necessary and proper memory fences would help.

Testing with smaller images first to see how the code performs before scaling up to 16k x 16k
might be a good idea. Profiling each part of the code (kernel execution time vs. memory transfer
time) will show where the bottlenecks are and guide optimization efforts.

Finally, I need to implement this step by step: set up the CUDA project, write the kernel, manage
memory, handle multiple GPUs, optimize for each GPU’s strengths, and then figure out the storage
strategy without slowing down the GPU. Testing each part incrementally will help catch issues
early on.
</think>

Creating a Mandelbrot image set using CUDA C 12.6 with high FP precision across various NVIDIA
GPUs involves several key steps and optimizations. Here’s an organized approach to achieve this
efficiently:

### 1. **CUDA Setup and Project Configuration**
— **Environment**: Use Visual Studio 2022 with the CUDA toolkit installed.
— **Compute Capability**: Target compute capability 8.x using CUDA 12.6 for compatibility
across all specified GPUs.

### 2. **GPU Architecture Optimization**
— **RTX 3500 Ada Mobile**: Utilize FP32 due to high clock speeds and fewer cores, optimizing
for single-precision tasks.
— **RTX A6000 & Dual RTX A4500 with NVLink**: Leverage more cores and NVLink for higher
precision tasks, potentially using FP64 where needed.
— **Dual RTX 4090**: Exploit Tensor Cores with mixed-precision (FP16/FP32) computations for
speed while maintaining sufficient accuracy.

### 3. **Mandelbrot Kernel Implementation**
— **Parallelization**: Assign each pixel to a CUDA thread, using 2D grids and blocks for
efficient memory access.
— **Memory Management**: Use device arrays for iteration counts to minimize host-device
transfers. Allocate VRAM efficiently for large images (e.g., 16k x 16k).

### 4. **Precision Handling**
— **Hybrid Approach**: Use FP32 for most calculations and switch to FP64 for deep zoom levels
requiring high precision.
— **Mixed Precision**: On Tensor Core-equipped GPUs, use FP16 for speed with fallback to higher
precision where necessary.

### 5. **Optimization Techniques**
— **Coalesced Memory Access**: Organize pixels in contiguous blocks to enhance global memory
access efficiency.
— **Occupancy Management**: Adjust threads per block to balance register usage and active
warps, especially for FP64 operations.
— **Async Memory Transfers**: Use CUDA streams for asynchronous data transfer, reducing GPU
idle time.

### 6. **Multi-GPU Handling**
— **Workload Distribution**: Split tasks across GPUs using CUDA’s multi-GPU functions. Consider
peer-to-peer memory access over NVLink for faster data transfer between dual GPUs.
— **Synchronization**: Use atomic operations and memory fences to prevent race conditions in
shared data structures.

### 7. **Storage Strategy**
— **Batch Processing**: Accumulate multiple frames in device memory before transferring to
host, minimizing PCIe bus usage.
— **Compression Consideration**: Evaluate hardware-accelerated compression on the GPU to reduce
data transfer volume without significant overhead.

### 8. **Image Handling and Profiling**
— **24-bit Color Storage**: Use structs or packed formats for RGB pixels to manage memory
efficiently.
— **Performance Profiling**: Test with smaller images first, profiling kernel execution vs.
memory transfer times to identify bottlenecks.

### 9. **Incremental Testing**
— **Step-by-Step Development**: Implement and test each component (kernel, memory management,
multi-GPU handling) incrementally to isolate issues early.

By following this structured approach, you can efficiently generate high-precision Mandelbrot
images across various GPU architectures while optimizing performance and storage efficiency.

total duration: 2m28.5243544s
load duration: 8.8316381s
prompt eval count: 273 token(s)
prompt eval duration: 605ms
prompt eval rate: 451.24 tokens/s
eval count: 1964 token(s)
eval duration: 2m18.84s
eval rate: 14.15 tokens/s

Google Gemini Advanced

Prompt 1: shallow ask

Prompt 2: deeper ask

Security

DeepSeek's Safety Guardrails Failed Every Test Researchers Threw at Its AI Chatbot

Security researchers tested 50 well-known jailbreaks against DeepSeek's popular new AI chatbot. It didn't stop a single…

www.wired.com

LinkedIn Login, Sign in | LinkedIn

Login to LinkedIn to keep in touch with people you know, share ideas, and build your career.

www.linkedin.com

Protecting AI so AI Can Improve the World, Safely

You need to move fast with AI, but you absolutely cannot afford to sacrifice safety and security for speed. Announcing…

blogs.cisco.com

Conclusions:

Deepseek has created a reasoning model that approaches the OpenAI o1 model. The key point is that this model is open source and available for local, secure, air-gapped deployment.

Depending on the prompt, sometimes deepseek R1 goes into a very deep repetitive introspection before summarizing the resulting code — reworking itself — which is interesting. The fact that the model corrects itself and verbosely describes what it is doing is nice — almost like the model was trained also on FAANG leetcode interviewing techniques. OpenAI o1 pro generates an introspection plan upfront — or at least we are given the first iteration of the plan.

On other occasions with longer prompts — deepseek-r1 may not generate actual code. The key is how well the developer/architect/devops communicates intent to the statistical vector based LLM model — in general.

I spent the weekend before the temporary stock market dip — evaluating the deepseek R1 70b model. I was planning on including it with a general performance optimization story. Based on events today I think a preliminary article on quickly running the deepseek model — offline — without using the web or mobile app would be helpful.

The next article will be on low level — architecture aware optimization starting with the following…

/michael

For developers

Running reasoning LLMs like the DeepSeek-R1 70b 43G locally for private / offline / air-gapped / secure environments via Ollama

Bill of Materials

Running the 70B LLaMA 2 LLM locally on Metal via llama.cpp on Mac Studio M2 Ultra

TLDR; GPU memory size is key to running large LLMs — Apple Silicon because of its unified memory allows for local…

Running the larger Google Gemma 7B 35GB LLM for 7x Inference Performance gain

TLDR; Run Gemma 7B on a single GPU with over 40 GB VRAM — preferably an 80 GB H100, 40 GB A100 or a 48G RTX-A6000 —…

Model Analysis

RTX-A6000 48G card running deepseek-r1:70b

Dual RTX-4090 24+24=48G cards running deepseek-r1:70b

Macbook Pro 16 M4 Max 40 SM (5120 core) 48G unified ram — deepseek-r1:70b

Shallow prompt

Answer

1. Host Function Initialization

2. CUDA Kernel Launch

3. Kernel Function

4. Data Transfer and Error Handling

5. Image Generation

6. Performance Optimization

7. Testing Across GPUs

Sample CUDA C Code Outline

Explanation

Advice

RTX-4500 dual deepseek-r1:32b model

Windows 13800H single RTX-3500 Ada 12G VRAM — deepseek-r1:14b

Adjacent LLM Models

Appendix A:

OpenAI o1 pro $300CAD/m comparison

Prompt 1: shallow ask from above

Prompt 2: deeper ask

DeepSeek-R1:70b

Prompt 1: shallow ask

Prompt 2: deeper ask

Google Gemini Advanced

Prompt 1: shallow ask

Prompt 2: deeper ask

Security

DeepSeek's Safety Guardrails Failed Every Test Researchers Threw at Its AI Chatbot

Security researchers tested 50 well-known jailbreaks against DeepSeek's popular new AI chatbot. It didn't stop a single…

LinkedIn Login, Sign in | LinkedIn

Login to LinkedIn to keep in touch with people you know, share ideas, and build your career.

Protecting AI so AI Can Improve the World, Safely

You need to move fast with AI, but you absolutely cannot afford to sacrifice safety and security for speed. Announcing…

Conclusions:

Written by Michael O'Brien

Responses (3)