Install CUDA to Enable GPU Acceleration for LLM Apps

Download Links

NVIDIA GPU Drivers

Global users: https://www.nvidia.com/en-us/geforce/drivers/
Mainland China users: https://www.nvidia.cn/geforce/drivers/

CUDA

https://developer.nvidia.com/cuda-downloads

CUDA Toolkit 12.9.1 (June 2025)

CUDA Toolkit 12.9 Update 1 has made performance improvements in multiple components and libraries, thereby enhancing overall computing performance, especially bringing more efficient experiences to users in key areas such as matrix operations. The specific performance improvements are as follows:

cuBLAS Library
- Optimization of Specific Block Scaling Performance: On NVIDIA Hopper GPUs, the scaling performance of 128×128 element 2D blocks has been improved for version 12.9 Update 1. In large AI model training and inference, which involve a large number of matrix operations, this optimization enables faster speed and higher computational efficiency when processing matrix calculations of specific scales.
- Acceleration of Multi - Precision Matrix Multiplication: On some Blackwell GPUs, by simulating FP32 using the BF16x9 algorithm, the computational performance of FP32 matrix multiplication has been significantly improved, with the speed increasing by up to 3 times and energy efficiency also being enhanced. During training and inference, when FP32 precision matrix multiplication is involved, calculations can be completed faster with less energy consumption, which helps reduce computing costs and improve the overall performance of the system.
- New Scaling Modes for Better Performance: New scaling modes have been added on Hopper (sm_90), including outer vector (per - channel/per - row), per - 128 - element, and per - 128x128 block scaling. These new scaling modes can flexibly adjust the scaling method according to different computing needs, achieving better performance in various scenarios. For example, in some calculations that require fine - grained scaling of matrices row by row or column by column, the outer vector scaling mode can complete the task more efficiently.
cuFFT Library
- Fixing Computing Issues to Improve Stability: It has solved the problem that some kernels cannot be launched on SM103 (B300) systems, as well as Jetson - specific PTX JIT issues, and optimized workspace calculation. In the fast Fourier transform calculations involved in large models, these improvements avoid computational interruptions caused by kernel launch failures or PTX JIT problems, ensuring the smooth progress of computing tasks and indirectly improving the stability of large model training and inference.
- Optimizing Workspace Calculation: For most small sizes that can be decomposed into small prime numbers, the required workspace is now zero. This means that when performing fast Fourier transforms on some small - size data, no additional workspace is needed, which reduces memory usage, improves memory utilization efficiency, and thus enhances computing performance.
Other Aspects: Various components in CUDA Toolkit 12.9 Update 1 have been updated, such as Thrust, CUB, and libcu++ in the CUDA C++ Core Compute Libraries, which have all been updated to version 2.8.2; the CUDA Runtime (cudart) has been updated to version 12.9.79; and the CUDA NVCC has been updated to version 12.9.86. These component version updates usually include performance optimizations. For example, compiler optimizations can generate more efficient code, enabling more full utilization of GPU resources when executing CUDA programs and improving overall computing performance.

Do you need CUDA?

After downloading LLM Apps with LM Downloader, many users have encountered this issue: even though the computer has high-end hardware, the software runs surprisingly slow, with the CPU maxed out while the powerful GPU sits mostly idle. If you do have a high-performance dedicated GPU, the problem is likely due to improper installation or configuration of the required acceleration software.

For example, when running ComfyUI for image/video generation with complex prompts, you might endure long wait times. The task manager shows the CPU working overtime, while the GPU sits there 'twiddling its thumbs.'

This happens because such software typically requires massive computational workloads. While CPUs are versatile, they're far less efficient than GPUs at handling these large-scale parallel computing tasks. Without proper configuration of acceleration tools, the GPU's potential goes untapped—resulting in sluggish performance.

Currently, utilizing NVIDIA GPUs for large model training and inference acceleration remains the mainstream choice. Since its launch in 2006, CUDA has undergone years of iteration and evolved into a mature ecosystem featuring a tightly integrated "hardware-software-developer" closed loop with high uniformity.

This article primarily focuses on installing CUDA for NVIDIA graphics cards in Windows systems to accelerate large model software. It's worth noting that running AMD's ROCm on Windows requires WSL (Windows Subsystem for Linux), resulting in a significantly different installation process compared to CUDA. We will provide detailed documentation on ROCm setup in a separate guide.

Many AI applications (such as ComfyUI, LLaMA, Stable Diffusion, and Spark-TTS) are developed using PyTorch. This creates an ecosystem similar to Android - since it's widely adopted, software naturally supports it by default. When you use AI for image generation or chatting, the software leverages PyTorch to utilize your GPU (requiring CUDA for NVIDIA graphics cards), resulting in significantly faster and smoother performance.

However, if your computer lacks an NVIDIA GPU (or hasn't installed CUDA), the software may fall back to CPU-only operation, which is substantially slower.

Introduction to Chip Manufacturers' Tools

NVIDIA CUDA

Introduced in 2006, CUDA (Compute Unified Device Architecture) is NVIDIA's proprietary parallel computing platform designed to harness GPU acceleration for complex computations. It provides a comprehensive suite of development tools and libraries, enabling developers to efficiently leverage NVIDIA GPUs for high-performance computing. With a mature ecosystem, CUDA is widely adopted in academia and industry, supporting major deep learning frameworks and scientific computing libraries. However, it is exclusive to NVIDIA GPUs.

AMD ROCm

Launched in 2015, ROCm (Radeon Open Compute) is AMD’s open-source alternative to CUDA, targeting high-performance computing (HPC) and large-scale GPU acceleration. It includes developer tools, software frameworks, libraries, compilers, and programming models. While primarily optimized for AMD GPUs, ROCm is gradually expanding support for other hardware vendors. Though relatively newer, its ecosystem is growing rapidly and already supports multiple deep learning frameworks.

Intel’s AI Tools (OpenVINO™, oneAPI, IPEX)

Intel’s Core Ultra processors (e.g., Core Ultra 200 series) deliver robust AI capabilities, offering up to 120 TOPS of compute power—sufficient for locally deployed large-scale AI models. As the leader in x86 processors, Intel prioritizes compatibility across Windows/Linux and frameworks like PyTorch/TensorFlow, adopting an open-standard, modular toolset approach:

OpenVINO™: Built on open VPU APIs, it supports cross-vendor hardware (e.g., NPUs from Intel/AMD/ARM).
oneAPI: An open, unified programming model (e.g., SYCL) for heterogeneous computing (CPU/GPU/FPGA).
IPEX & Neural Compressor: Framework extensions optimized for PyTorch/TensorFlow’s native interfaces.

Principles of Acceleration Inference

CUDA

NVIDIA GPUs feature a massive number of cores capable of processing multiple computational tasks simultaneously. CUDA accelerates performance by breaking down tasks into smaller subtasks and distributing them across GPU cores for parallel execution. For instance, during the training and inference of large AI models—which involve extensive matrix operations—CUDA leverages the GPU’s parallel architecture to complete these computations far faster than a CPU’s sequential processing.

ROCm

Similarly, ROCm harnesses the parallel computing power of AMD GPUs. It employs the HIP (Heterogeneous-compute Interface for Portability) programming model to distribute tasks across AMD GPU cores for concurrent processing. Additionally, ROCm includes highly optimized libraries like rocBLAS and rocFFT, tailored for machine learning and high-performance computing (HPC) workloads. These libraries maximize AMD GPU efficiency, significantly speeding up large AI model operations.

Checking NVIDIA Graphics Card Information

Windows

Right-click on an empty area of your desktop and select "Display settings", then scroll down and click "Advanced display settings". Here, you can view the monitors connected to your system and the corresponding graphics adapter information.

Via Device Manager

Press Win + X to open the system menu, then select "Device Manager". Locate and expand the "Display adapters" section—the listed graphics card name will help determine whether it's a dedicated GPU. Typically, dedicated graphics cards have specific model names, often including the brand and series.

Example Scenarios:

In the following image, the display adapter only shows an Intel integrated graphics card.

In this image, the display adapter lists both an AMD integrated graphics card and an NVIDIA GeForce RTX 5060 Ti dedicated GPU.

Downloading and Installing Drivers

Go to the NVIDIA website to download the drivers. It's recommended to use the latest version whenever possible. If you don't prioritize immediate support for new games, choose the "Studio Drivers" for greater stability and reliability.

Global users: https://www.nvidia.com/en-us/geforce/drivers/
Mainland China users: https://www.nvidia.cn/geforce/drivers/

Select your region accordingly if you’re outside these areas.

Installation Tip:
If you have no specific requirements, choose the "Express" installation option for a hassle-free setup.

Checking GPU CUDA Compatibility

Verifying CUDA Support and Version

For NVIDIA graphics cards, there are two methods to check the compatible CUDA version:

Method 1: NVIDIA Control Panel

Open NVIDIA Control Panel
Click "System Information"
Navigate to the "Components" tab
Check the maximum CUDA version supported by your current driver

Note: If you see "NVIDIA CUDA 12.9.76 driver", this indicates support for CUDA version 12.9.

Method 2: Command Line

Open Command Prompt or PowerShell
Enter the command: nvidia-smi
Locate the "CUDA Version" field in the output

Key information: "CUDA Version: 12.9" means your GPU supports CUDA 12.9.

Important:
The supported CUDA version may vary between systems. Please verify based on your specific hardware configuration.

Downloading and Installing CUDA

1. Downloading CUDA

Visit the official CUDA download page:
https://developer.nvidia.com/cuda-downloads

Select the appropriate CUDA version based on:
- Your GPU's supported CUDA version
- Your operating system
For example, Windows users should:
- Select: Windows → x86_64 → Version 11 → exe (local)
- Note:
  - exe (local): Complete package (~3.31GB), recommended for offline installation
  - exe (network): Small installer (~13.9MB), requires internet connection during installation

2. Installing CUDA

Run the downloaded EXE installer
Wait for:
- File extraction to complete
- System compatibility check to finish
Accept the license agreement ("Agree and Continue")
Choose installation type:
- Express Installation: Recommended for most users (default settings)
- Custom Installation: For advanced users (can deselect unnecessary components)
Complete the installation and restart your computer if prompted

What to Do When This Option Appears

If you see the "CUDA Visual Studio Integration" option during installation:

For Most Users (No Visual Studio Installed):
- Simply check the box and click "Next"
- Note: This may appear if Visual Studio isn't detected - most AI/ML applications will work fine without it
For Developers (Optional):
- If you plan to do CUDA programming in Visual Studio:
  1. Ensure Visual Studio 2019/2022 is installed first
  2. Then re-run the CUDA installer
  3. Select this component

Verify Installation: Open the command prompt and enter "nvcc --version". If the CUDA version information is displayed, it means the installation was successful.

If you have installed 12.9, you could see：

C:\Users\LMD>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Apr__9_19:29:17_Pacific_Daylight_Time_2025
Cuda compilation tools, release 12.9, V12.9.41
Build cuda_12.9.r12.9/compiler.35813241_0

If you have installed 12.9 Update 1:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:24:01_Pacific_Daylight_Time_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0

If previously installed apps still fail to enable GPU acceleration, please reinstall the apps using LM Downloader. Reinstalling will not delete your data and model files. However, deleting the apps often removes the models and related data files.

If you still encounter issues, please contact our technical support team. tech@daiyl.com

Install CUDA to Enable GPU Acceleration for LLM Apps ​

Download Links ​

NVIDIA GPU Drivers ​

CUDA ​

CUDA Toolkit 12.9.1 (June 2025) ​

Do you need CUDA? ​

Introduction to Chip Manufacturers' Tools ​

NVIDIA CUDA ​

AMD ROCm ​

Intel’s AI Tools (OpenVINO™, oneAPI, IPEX) ​

Principles of Acceleration Inference ​

CUDA ​

ROCm ​

Checking NVIDIA Graphics Card Information ​

Windows ​

Via Device Manager ​

Downloading and Installing Drivers ​

Checking GPU CUDA Compatibility ​

Verifying CUDA Support and Version ​

Downloading and Installing CUDA ​

1. Downloading CUDA ​

2. Installing CUDA ​

What to Do When This Option Appears ​

Install CUDA to Enable GPU Acceleration for LLM Apps

Download Links

NVIDIA GPU Drivers

CUDA

CUDA Toolkit 12.9.1 (June 2025)

Do you need CUDA?

Introduction to Chip Manufacturers' Tools

NVIDIA CUDA

AMD ROCm

Intel’s AI Tools (OpenVINO™, oneAPI, IPEX)

Principles of Acceleration Inference

CUDA

ROCm

Checking NVIDIA Graphics Card Information

Windows

Via Device Manager

Downloading and Installing Drivers

Checking GPU CUDA Compatibility

Verifying CUDA Support and Version

Downloading and Installing CUDA

1. Downloading CUDA

2. Installing CUDA

What to Do When This Option Appears