Usage guide
This guide covers the core concepts and patterns for using the vllora_llm crate effectively.
Basic usage: completions client (gateway-native)
The main entrypoint is VlloraLLMClient, which gives you a CompletionsClient for chat completions using the gateway-native request/response types.
use std::sync::Arc;
use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;
#[tokio::main]
async fn main() -> LLMResult<()> {
// In production you would pass a real ModelInstance implementation
// that knows how to call your configured providers / router.
let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));
// Build the high-level client
let client = VlloraLLMClient::new_with_instance(instance);
// Build a simple chat completion request
let request = ChatCompletionRequest {
model: "gpt-4.1-mini".to_string(), // or any gateway-configured model id
messages: vec![
ChatCompletionMessage::new_text(
"system".to_string(),
"You are a helpful assistant.".to_string(),
),
ChatCompletionMessage::new_text(
"user".to_string(),
"Say hello in one short sentence.".to_string(),
),
],
..Default::default()
};
// Send the request and get a single response message
let response = client.completions().create(request).await?;
let message = response.message();
if let Some(content) = &message.content {
if let Some(text) = content.as_string() {
println!("Model reply: {text}");
}
}
Ok(())
}
Key pieces:
VlloraLLMClient: wraps aModelInstanceand exposes.completions().CompletionsClient::create: sends a one-shot completion request and returns aChatCompletionMessageWithFinishReason.- Gateway types (
ChatCompletionRequest,ChatCompletionMessage) abstract over provider-specific formats.
Streaming completions
CompletionsClient::create_stream returns a ResultStream that yields streaming chunks:
use std::sync::Arc;
use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;
#[tokio::main]
async fn main() -> LLMResult<()> {
let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));
let client = VlloraLLMClient::new_with_instance(instance);
let request = ChatCompletionRequest {
model: "gpt-4.1-mini".to_string(),
messages: vec![ChatCompletionMessage::new_text(
"user".to_string(),
"Stream the alphabet, one chunk at a time.".to_string(),
)],
..Default::default()
};
let mut stream = client.completions().create_stream(request).await?;
while let Some(chunk) = stream.next().await {
let chunk = chunk?;
for choice in chunk.choices {
if let Some(delta) = choice.delta.content {
print!("{delta}");
}
}
}
Ok(())
}
The stream API mirrors OpenAI-style streaming but uses gateway-native ChatCompletionChunk types.
Supported parameters
The table below lists which ChatCompletionRequest (and provider-specific) parameters are honored by each provider when using VlloraLLMClient:
| Parameter | OpenAI / Proxy | Anthropic | Gemini | Bedrock | Notes |
|---|---|---|---|---|---|
model | yes | yes | yes | yes | Taken from ChatCompletionRequest.model or engine config. |
max_tokens | yes | yes | yes | yes | Mapped to provider-specific max_tokens / max_output_tokens. |
temperature | yes | yes | yes | yes | Sampling temperature. |
top_p | yes | yes | yes | yes | Nucleus sampling. |
n | no | no | yes | no | For Gemini, mapped to candidate_count; other providers always use n = 1. |
stop / stop_sequences | yes | yes | yes | yes | Converted to each providers' stop / stop-sequences field. |
presence_penalty | yes | no | yes | no | OpenAI / Gemini only. |
frequency_penalty | yes | no | yes | no | OpenAI / Gemini only. |
logit_bias | yes | no | no | no | OpenAI-only token bias map. |
user | yes | no | no | no | OpenAI "end-user id" field. |
seed | yes | no | yes | no | Deterministic sampling where supported. |
response_format (JSON schema, etc.) | yes | no | yes | no | Gemini additionally normalizes JSON schema for its API. |
prompt_cache_key | yes | no | no | no | OpenAI-only prompt caching hint. |
provider_specific.top_k | no | yes | no | no | Anthropic-only: maps to Claude top_k. |
provider_specific.thinking | no | yes | no | no | Anthropic "thinking" options (e.g. budget tokens). |
Bedrock additional_parameters map | no | no | no | yes | Free-form JSON, passed through to Bedrock model params. |
Additionally, for Anthropic, the first system message in the conversation is mapped into a SystemPrompt (either as a single text string or as multiple TextContentBlocks), and any cache_control options on those blocks are translated into Anthropic's ephemeral cache-control settings.
All other fields on ChatCompletionRequest (such as stream, tools, tool_choice, functions, function_call) are handled at the gateway layer and/or per-provider tool integration, but are not mapped 1:1 into provider primitive parameters.
Notes
- Real usage: In the full LangDB / Vllora gateway, concrete
ModelInstanceimplementations are created by the core executor based on yourmodels.yamland routing rules; the examples above useDummyModelInstanceonly to illustrate the public API of theCompletionsClient. - Error handling: All client methods return
LLMResult<T>, which wraps richLLMErrorvariants (network, mapping, provider errors, etc.). - More features: The same types in
vllora_llm::types::gatewayare used for tools, MCP, routing, embeddings, and image generation; see the main repository docs athttps://vllora.dev/docsfor higher-level gateway features.
Roadmap and issues
- GitHub issues / roadmap: See open LLM crate issues for planned and outstanding work.
- Planned enhancements:
- Integrate responses API
- Support builtin MCP tool calls
- Gemini prompt caching supported
- Full thinking messages support
License
Licensed under either of Apache License, Version 2.0.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this crate by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.