Configuration
AgentConfig
The AgentConfig struct is created internally when you call Agent::make(). Defaults are sensible for most use cases:
pub struct AgentConfig {
pub system_prompt: String, // default: ""
pub template: ChatTemplate, // default: Chatml
pub max_iterations: usize, // default: 10
pub eviction_strategy: EvictionStrategy, // default: 8K tokens
}You control these via builder methods, not by constructing AgentConfig directly.
EvictionStrategy
Controls when and how old messages are removed from context:
pub struct EvictionStrategy {
pub max_safe_tokens: usize, // default: 8000
}When total_tokens + prompt_overhead > max_safe_tokens, the framework pops the oldest messages (FIFO) until the budget fits.
let agent = Agent::make(config).await?
.with_eviction_strategy(EvictionStrategy { max_safe_tokens: 4096 });The default of 8K is a rough safe point for 8K-context models. For 128K models you might set it to 64K or higher. The exact value depends on how much output room you need.
LLMEngineConfig
This is the enum you pass to Agent::make():
pub enum LLMEngineConfig {
#[cfg(feature = "openai-api")]
OpenAI(OpenAIEngineConfig),
#[cfg(feature = "llama-cpp")]
Llama(LlamaEngineConfig),
Custom(Box<dyn LLMEngineTrait>),
}OpenAI config
OpenAIEngineConfig {
api_key: String,
base_url: String, // "https://api.openai.com/v1"
model_name: String, // "gpt-4o"
temp: f32, // 0.0 - 2.0
top_p: f32, // 0.0 - 1.0
}base_url can point to any OpenAI-compatible endpoint (DeepSeek, Ollama with OpenAI adapter, etc.).
Llama.cpp config
LlamaEngineConfig {
model_path: String, // path to .gguf file
mmproj_path: Option<String>, // external vision projector (e.g., mmproj-model-f16.gguf)
integrated_vision: bool, // whether the model has native vision capabilities
max_tokens: i32, // max tokens to predict
buffer_size: usize, // batch buffer size for piece decoding
use_gpu: bool, // offload layers to GPU
n_gpu_layers: u32, // how many layers to offload to GPU
n_ctx: u32, // context window size
n_tokens: usize, // batch size for prompt processing
n_seq_max: i32, // max sequences in a batch
penalty_last_n: i32, // past tokens to consider for penalties
penalty_repeat: f32, // repetition penalty
penalty_freq: f32, // frequency penalty
penalty_present: f32, // presence penalty
temp: f32, // temperature (0.0 – 2.0)
top_p: f32, // nucleus sampling threshold
seed: u32, // RNG seed for deterministic generation
min_keep: usize, // min-keep sampling boundary
}Validation runs at load time – if required fields are missing or out of range, you get an EngineError immediately rather than a cryptic crash mid-inference.
Feature flags
[dependencies]
ambi = { version = "0.3", default-features = false, features = ["openai-api"] }| Feature | What it enables | Dependencies |
|---|---|---|
openai-api | OpenAI-compatible cloud backend | async-openai |
llama-cpp | Local inference via llama.cpp | llama-cpp-2, llama-cpp-sys-2 |
cuda | CUDA acceleration (implies llama-cpp) | + CUDA SDK |
vulkan | Vulkan acceleration | + Vulkan SDK |
metal | Apple Metal acceleration | + Metal framework |
rocm | AMD ROCm acceleration | + ROCm |
macro | #[tool] and #[agent] attribute macros (see ambi-macros) | ambi-macros |
mtmd | Multimodal support for Llama (VLM) | + base64 |
You cannot enable more than one GPU backend at once – there's a compile-time compile_error! guard for this.
Adding to the runtime requirement
tokio = { version = "1", features = ["rt-multi-thread", "sync", "time", "macros"] }See native platform for details.