Getting Started

This guide will help you to setup your first inference API using Infersec's console and your own hardware. Note that at this point that there are a number of assumptions being made:

You have access to a Linux/macOS machine with suitably-well-performing system specifications in terms of running a model. For Linux machines, this assumes you have a GPU with at least 8GB of VRAM, with its drivers installed and up to date, including peripheral software to enable a model to run (eg. CUDA for NVIDIA cards). For macOS devices, this assumes you're on an Apple Silicon device with ideally 16GB+ RAM.
You already have an Infersec account with credits added.

With these points taken care of, following this guide will help you to setup a new model, connect it to a source, connect that source to your machine using an API key, and then expose that source using an endpoint:

  User --> Infersec Endpoint --> Infersec Source <-- Infersec Conduit --> LLM Engine --> Your GPU

Conduit is our software that runs on your device, connecting your machine to the Infersec cloud (tied to a source).

Step 1 - Register a model

The first step is to create a model. The model will determine a number of things regarding your inferencing API:

The quality of the responses to your prompts.
The speed at which the AI responds.
How well the AI is able to determine which tools to call (MCP etc.), including how well it's able to call them at all.

For this exercise it is not important which model you run, but rather that you're able to configure a model to run at all. This is the largest hurdle in the setup process, as it requires the most configuration on your part.

Prepare your device

Before continuing, you must decide on the engine you'll use to run the model. llama.cpp is widely regarded as a good option, as it's quite stable, has a frequent update/release schedule, and supports Linux and macOS very well. vLLM is also an option, within Docker, but it is slightly more advanced. If you go llama.cpp, you'll look for GGUF model formats, and on vLLM likely Safetensors. Regardless, you should look into installing one of these before continuing:

llama.cpp
vLLM: No installation needed, but you will need Docker:
- OrbStack on macOS
- Docker on Linux

Finally, if you're using llama.cpp, you'll need NodeJS installed.

Configure the model

Navigate to Inferencing → Models and click Create model. Search for the model you want to serve — weights are sourced from HuggingFace. Give it a name and create it. The model's format determines which engine it can run on — GGUF models use llama.cpp, everything else uses vLLM. For a list of benchmarked models with performance data, see Recommended Models.

Step 2 - Create an API Key

API keys are used to provide external access to the following resources:

Endpoints (needed for your AI-consuming software)
Sources (needed for Conduit to connect)

Navigate to Administration → API Keys and create a new key. Make sure to immediately copy the value, as it won't be presented again.

Step 3 - Create an inference source

Navigate to Inferencing → Sources and click Create source. An inference source represents a single machine that will run inference. Select the engine and model (that you created in a previous step), configure context length and parallelism to suit your hardware, then create it. The inference source detail page shows the Conduit connect command.

Some details here are relevant to the configuration of the model, usually from HuggingFace. Context length is dependent on the model configuration and training. It is usually safe to set it at 16K or 32K, but if you check the configuration first you'll likely find higher values are possible. You usually want a larger context so as to increase the LLMs working memory.

The quantization, which is available on certain model formats (like GGUF), determines the overall "quality" of the model. Generally speaking you'll want Q4 or higher. This is a complex topic that deserves some reading, but start with Q4, Q4_K_M etc. and if that works well enough, try increasing it.

Step 4 - Connect your machine

Run the Conduit agent on the machine that will serve inference:

npx @infersec/conduit inference start \
  --engine <engine> \
  --key <your-api-key> \
  --source <source-id>

Conduit will download the model files, boot the engine, and report its state to Infersec. Once the inference source shows Online in the console, it's ready for traffic.

While this example shows that engine can be configured here, it is almost always going to be llama.cpp. vLLM should be run in Docker in most cases, unless you have it installed locally. The Inference Source page has instructions on the exact commands to run.

Step 5 - Create an inference endpoint

Navigate to Inferencing → Endpoints and click Create endpoint. An inference endpoint exposes a public URL that routes requests to one or more inference sources. Choose a routing method (first-available or round-robin), attach your inference sources, and enable it.

Step 6 - Use the inference endpoint

Inference endpoints expose OpenAI and Anthropic-compatible APIs. The base URL follows this pattern:

https://api.infersec.ai/api/inferencing/<endpoint-id>/oai

Authenticate using an Authorization: Bearer <key> or x-api-key header with your Infersec API key. The model name is always default.

To get quick feedback on whether your system is live or not:

Navigate to the sources list page, verify it shows your source as "Online".
Navigate to the endpoints list page, find your endpoint.
Click on the Chat icon at the end of the row.
Attempt to prompt your LLM - say hi.

Opencode

Add a custom provider to opencode.json:

{
    "$schema": "https://opencode.ai/config.json",
    "provider": {
        "infersec": {
            "models": { "default": { "name": "Default" } },
            "name": "Infersec",
            "npm": "@ai-sdk/openai-compatible",
            "options": {
                "apiKey": "<your-api-key>",
                "baseURL": "https://api.infersec.ai/api/inferencing/<endpoint-id>/oai/v1"
            }
        }
    }
}

AnythingLLM / LM Studio

Point a generic OpenAI-compatible connection at the inference endpoint base URL, using default as the model name and your API key for authorization.

Anthropic SDK

Use the Anthropic base URL at /api/inferencing/<endpoint-id>/anthropic with the x-api-key header and anthropic-version: 2023-06-01.

API reference

For programmatic management of models, sources, and endpoints, see the Resource Management API docs and the interactive API reference.