Usage guide

This guide covers the core concepts and patterns for using the vllora_llm crate effectively.

Basic usage: completions client (gateway-native)

The main entrypoint is VlloraLLMClient, which gives you a CompletionsClient for chat completions using the gateway-native request/response types.

use std::sync::Arc;

use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;

#[tokio::main]
async fn main() -> LLMResult<()> {
    // In production you would pass a real ModelInstance implementation
    // that knows how to call your configured providers / router.
    let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));

    // Build the high-level client
    let client = VlloraLLMClient::new_with_instance(instance);

    // Build a simple chat completion request
    let request = ChatCompletionRequest {
        model: "gpt-4.1-mini".to_string(), // or any gateway-configured model id
        messages: vec![
            ChatCompletionMessage::new_text(
                "system".to_string(),
                "You are a helpful assistant.".to_string(),
            ),
            ChatCompletionMessage::new_text(
                "user".to_string(),
                "Say hello in one short sentence.".to_string(),
            ),
        ],
        ..Default::default()
    };

    // Send the request and get a single response message
    let response = client.completions().create(request).await?;

    let message = response.message();
    if let Some(content) = &message.content {
        if let Some(text) = content.as_string() {
            println!("Model reply: {text}");
        }
    }

    Ok(())
}

Key pieces:

VlloraLLMClient: wraps a ModelInstance and exposes .completions().
CompletionsClient::create: sends a one-shot completion request and returns a ChatCompletionMessageWithFinishReason.
Gateway types (ChatCompletionRequest, ChatCompletionMessage) abstract over provider-specific formats.

Streaming completions

CompletionsClient::create_stream returns a ResultStream that yields streaming chunks:

use std::sync::Arc;

use vllora_llm::client::{VlloraLLMClient, ModelInstance, DummyModelInstance};
use vllora_llm::types::gateway::{ChatCompletionRequest, ChatCompletionMessage};
use vllora_llm::error::LLMResult;

#[tokio::main]
async fn main() -> LLMResult<()> {
    let instance: Arc<Box<dyn ModelInstance>> = Arc::new(Box::new(DummyModelInstance {}));
    let client = VlloraLLMClient::new_with_instance(instance);

    let request = ChatCompletionRequest {
        model: "gpt-4.1-mini".to_string(),
        messages: vec![ChatCompletionMessage::new_text(
            "user".to_string(),
            "Stream the alphabet, one chunk at a time.".to_string(),
        )],
        ..Default::default()
    };

    let mut stream = client.completions().create_stream(request).await?;

    while let Some(chunk) = stream.next().await {
        let chunk = chunk?;
        for choice in chunk.choices {
            if let Some(delta) = choice.delta.content {
                print!("{delta}");
            }
        }
    }

    Ok(())
}

The stream API mirrors OpenAI-style streaming but uses gateway-native ChatCompletionChunk types.

Supported parameters

The table below lists which ChatCompletionRequest (and provider-specific) parameters are honored by each provider when using VlloraLLMClient:

Parameter	OpenAI / Proxy	Anthropic	Gemini	Bedrock	Notes
`model`	yes	yes	yes	yes	Taken from `ChatCompletionRequest.model` or engine config.
`max_tokens`	yes	yes	yes	yes	Mapped to provider-specific `max_tokens` / `max_output_tokens`.
`temperature`	yes	yes	yes	yes	Sampling temperature.
`top_p`	yes	yes	yes	yes	Nucleus sampling.
`n`	no	no	yes	no	For Gemini, mapped to `candidate_count`; other providers always use `n = 1`.
`stop` / `stop_sequences`	yes	yes	yes	yes	Converted to each providers' stop / stop-sequences field.
`presence_penalty`	yes	no	yes	no	OpenAI / Gemini only.
`frequency_penalty`	yes	no	yes	no	OpenAI / Gemini only.
`logit_bias`	yes	no	no	no	OpenAI-only token bias map.
`user`	yes	no	no	no	OpenAI "end-user id" field.
`seed`	yes	no	yes	no	Deterministic sampling where supported.
`response_format` (JSON schema, etc.)	yes	no	yes	no	Gemini additionally normalizes JSON schema for its API.
`prompt_cache_key`	yes	no	no	no	OpenAI-only prompt caching hint.
`provider_specific.top_k`	no	yes	no	no	Anthropic-only: maps to Claude `top_k`.
`provider_specific.thinking`	no	yes	no	no	Anthropic "thinking" options (e.g. budget tokens).
Bedrock `additional_parameters` map	no	no	no	yes	Free-form JSON, passed through to Bedrock model params.

Additionally, for Anthropic, the first system message in the conversation is mapped into a SystemPrompt (either as a single text string or as multiple TextContentBlocks), and any cache_control options on those blocks are translated into Anthropic's ephemeral cache-control settings.

All other fields on ChatCompletionRequest (such as stream, tools, tool_choice, functions, function_call) are handled at the gateway layer and/or per-provider tool integration, but are not mapped 1:1 into provider primitive parameters.

Notes

Real usage: In the full LangDB / Vllora gateway, concrete ModelInstance implementations are created by the core executor based on your models.yaml and routing rules; the examples above use DummyModelInstance only to illustrate the public API of the CompletionsClient.
Error handling: All client methods return LLMResult<T>, which wraps rich LLMError variants (network, mapping, provider errors, etc.).
More features: The same types in vllora_llm::types::gateway are used for tools, MCP, routing, embeddings, and image generation; see the main repository docs at https://vllora.dev/docs for higher-level gateway features.

Roadmap and issues

GitHub issues / roadmap: See open LLM crate issues for planned and outstanding work.
Planned enhancements:
- Integrate responses API
- Support builtin MCP tool calls
- Gemini prompt caching supported
- Full thinking messages support

License

^{Licensed under either of Apache License, Version
2.0.}
_{Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in this crate by you, as defined in the Apache-2.0 license, shall
be dual licensed as above, without any additional terms or conditions.}

Basic usage: completions client (gateway-native)​

Streaming completions​

Supported parameters​

Notes​

Roadmap and issues​

License​