Code: Select all
#!/usr/bin/env bash
# Script to run Qwen3.6-35B-A3B MTP model optimized for RTX 4070 (8GB VRAM) and i9-13890HX (24-core CPU, 96GB RAM)
# Path to the compiled CUDA llama-server binary
SERVER_BIN="/path/to/llama.cpp/bin/CUDA/llama-server"
# Model configuration
MODEL_PATH="/path/to/models/Qwen3.6-35B-A3B-MTP/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf"
MMPROJ_PATH="/path/to/models/Qwen3.6-35B-A3B-MTP/mmproj-F16.gguf"
# Hardware optimizations
# Set GPU_LAYERS to 99 (all layers) but use --cpu-moe and --cpu-moe-draft to force the massive
# MoE expert weights to reside in system RAM (96GB), keeping the hot path (Attention, KV Cache,
# Emb/Out heads, MTP dense layers) in high-speed GPU VRAM (8GB) without causing OOM.
GPU_LAYERS=99
# Threads setup for i9-13890HX (8 Performance cores, 16 Efficient cores).
# Heavy matrix computation (LLM inference) runs best matching the physical Performance cores (8 threads).
THREADS=8
THREADS_BATCH=16 # Batch processing can benefit from more threads
# Context and KV Cache configurations
# Keep context size reasonable (e.g., 32768) to prevent massive KV cache overhead on system memory.
# --cache-type-k q8_0 and --cache-type-v q8_0 compress KV cache to 8-bit to save memory bandwidth.
CTX_SIZE=163840
# Run the llama-server pinned to Intel P-Cores (cores 0-15) for maximum performance
exec taskset -c 0-15 "$SERVER_BIN" \
--model "$MODEL_PATH" \
--mmproj "$MMPROJ_PATH" \
--n-gpu-layers "$GPU_LAYERS" \
--cpu-moe \
--cpu-moe-draft \
--parallel 1 \
--ctx-size "$CTX_SIZE" \
--threads "$THREADS" \
--threads-batch "$THREADS_BATCH" \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--flash-attn on \
--no-mmap \
--seed 3407 \
--port 8080 \
--host 0.0.0.0 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--jinjaUsing three.js make the scene of the moon landing. Show the lunar lander and the US flag being planted by Neil Armstrong.
And another quick test.
Using C++ and QT make a calculator that uses commas as a thousand indicator. I have a hard time telling if I typed 15 million or 1.5 million
Please let me know if there are any tricks to get this to run even faster. I think with time local AI will be useful for coding tasks and other high-end workloads. I just hope we will be living in a world where companies will sell graphics cards with high amounts of VRAM. I hope renting them or a service becomes a thing of the past.