Skip to main content
Transforms reference images into dynamic video sequences. Preserves identity, layout, and text from reference images while adding realistic motion, camera movements, and scene progression. Supports multi-shot generation with per-shot prompts and durations, and optional native audio (Chinese/English). Model name: kling-video-o3-pro-reference-to-video Pricing: $0.224/second of output video (same rate with audio) Input: image + text | Output: video

Endpoint

POST /api/videos/generate
Video generation is synchronous, the request blocks until the video is ready (typically 1-5 minutes). Billing is calculated as total video duration x $0.224/s. A default 5-second video costs ~$1.12. For fire-and-forget or batch generation, use /ai/queue instead.

Request Parameters

ParameterTypeRequiredDefaultDescription
modelstringyes"kling-video-o3-pro-reference-to-video"
promptstringone ofSingle prompt for the video. Use this or multi_prompt, not both. Max 512 characters.
multi_promptarrayone ofMulti-shot prompts. See multi_prompt below.
durationintegerno5Duration in seconds when using prompt.
input_imagearray of URIsnoReference images for style/appearance (max 4 combined with elements). Reference in prompts as @Image1, @Image2, etc.
start_image_urlstring (URI)noFirst frame of the video. The model extends from this image.
tail_image_urlstring (URI)noLast frame of the video. Requires start_image_url. The model fills in between the frames.
elementsarray of objectsnoStructured element references for characters/objects. See elements below.
negative_promptstringno"blur, distort, and low quality"Text describing what to avoid in the generated video.
aspect_ratiostringno"16:9""9:16", "1:1", or "16:9".
generate_audiobooleannofalseGenerate native audio. Supports Chinese and English voice output.
response_formatstringno"url""url" returns a hosted URL. "b64_json" returns base64-encoded video bytes inline.
target_namespacestringnocurrent userNamespace to save results and bill to. Can be an organization name.

prompt vs multi_prompt

Use either prompt or multi_prompt, not both. Sending both returns:
"Cannot provide both 'prompt' and 'multi_prompt'."
Sending neither (or an empty multi_prompt: []) returns:
"Either 'prompt' or 'multi_prompt' must be provided."
When using prompt, the duration defaults to 5 seconds. Override with duration:
{"model": "kling-video-o3-pro-reference-to-video", "prompt": "A flower blooming in timelapse", "duration": 10}

multi_prompt

Array of shot objects. Each shot generates a segment of the video.
FieldTypeRequiredDefaultDescription
promptstringyesPrompt for this shot. Max 512 characters.
durationintegerno5Duration of this shot in seconds (1-15).

Duration Constraints

ConstraintValue
Minimum total duration3 seconds
Maximum total duration15 seconds
Maximum per shot15 seconds
Default per shot5 seconds
Individual shots can be as short as 1 second, as long as the total across all shots is between 3 and 15 seconds.
ConfigurationTotalResult
Single shot, duration: 11sFails
Single shot, duration: 22sFails
Single shot, duration: 33sWorks
Two shots: duration: 2 + duration: 13sWorks
Two shots: duration: 1 + duration: 12sFails
Single shot, duration: 1515sWorks
Three shots: duration: 5 + duration: 5 + duration: 515sWorks
Three shots: duration: 5 + duration: 5 + duration: 616sFails
When total duration is too short:
"duration value '2' is invalid. Try using duration='5' instead, as duration support may vary by model and mode."
When total duration exceeds 15 seconds:
"Total shot duration (16s) exceeds maximum allowed (15s)."
When a single shot exceeds 15 seconds:
"Input should be '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14' or '15'"

elements

Array of element objects for character/object reference. Use @Element1, @Element2, etc. in prompts.
FieldTypeRequiredDescription
frontal_image_urlstring (URI)yesFront view of the reference object or character.
reference_image_urlsarray of URIsnoAdditional angles. Max 3 images per element.
Maximum 4 total images across all elements and input_image references.

Examples

Minimal: text only

input_image is optional. Without it the model generates purely from the prompt.
import requests

response = requests.post(
    "https://hub.oxen.ai/api/videos/generate",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    json={
        "model": "kling-video-o3-pro-reference-to-video",
        "prompt": "A puppy runs through a park",
    },
)

data = response.json()
print("Video URL:", data["videos"][0]["url"])

Single prompt with reference image

import requests

response = requests.post(
    "https://hub.oxen.ai/api/videos/generate",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    json={
        "model": "kling-video-o3-pro-reference-to-video",
        "prompt": "A dog runs across a sunny field",
        "input_image": ["https://example.com/dog.jpg"],
    },
)

data = response.json()
print("Video URL:", data["videos"][0]["url"])

Multi-shot with reference image

import requests

response = requests.post(
    "https://hub.oxen.ai/api/videos/generate",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    json={
        "model": "kling-video-o3-pro-reference-to-video",
        "multi_prompt": [
            {"prompt": "A woman walks toward the camera smiling, cinematic lighting", "duration": 5},
            {"prompt": "She turns and looks out a window, soft focus background", "duration": 5},
        ],
        "input_image": ["https://example.com/reference-face.jpg"],
    },
)

data = response.json()
print("Video URL:", data["videos"][0]["url"])

With start/end frames and elements

import requests

response = requests.post(
    "https://hub.oxen.ai/api/videos/generate",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    json={
        "model": "kling-video-o3-pro-reference-to-video",
        "multi_prompt": [
            {"prompt": "@Element1 picks up a coffee cup from the table", "duration": 5},
        ],
        "start_image_url": "https://example.com/first-frame.jpg",
        "tail_image_url": "https://example.com/last-frame.jpg",
        "elements": [
            {
                "frontal_image_url": "https://example.com/character-front.jpg",
                "reference_image_urls": ["https://example.com/character-side.jpg"],
            }
        ],
        "aspect_ratio": "16:9",
        "generate_audio": True,
    },
)

data = response.json()
print("Video URL:", data["videos"][0]["url"])

Response (response_format: "url")

{
  "created": 1775090723,
  "model": "kling-video-o3-pro-reference-to-video",
  "videos": [
    {
      "url": "https://hub.oxen.ai/api/repos/.../files/.../video.mp4?..."
    }
  ]
}
The URL is a temporary link that expires after a period of time.

Response (response_format: "b64_json")

{
  "created": 1775090723,
  "model": "kling-video-o3-pro-reference-to-video",
  "videos": [
    {
      "b64_json": "<base64-encoded mp4 bytes>"
    }
  ]
}

Using with /ai/queue

Recommended for video generation. Returns immediately, processes in the background.

Enqueue

import requests

response = requests.post(
    "https://hub.oxen.ai/api/ai/queue",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    json={
        "model": "kling-video-o3-pro-reference-to-video",
        "multi_prompt": [{"prompt": "A person speaking into a microphone", "duration": 5}],
        "generate_audio": True,
        "num_generations": 2,
    },
)

generations = response.json()["generations"]
for g in generations:
    print(f"ID: {g['generation_id']}, Status: {g['status']}")

Poll

import requests

response = requests.get(
    "https://hub.oxen.ai/api/media/generations/status/YOUR_USERNAME/kling-video-o3-pro-reference-to-video",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
)

status = response.json()
print(f"Remaining: {status['count']}")
When finished, the generation disappears from the list. A count of 0 means all generations are complete.

Cancel

import requests

response = requests.delete(
    "https://hub.oxen.ai/api/media/generations/YOUR_USERNAME/4ef840a4-...",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
)

print(response.json())

Errors

ErrorCauseFix
Cannot provide both 'prompt' and 'multi_prompt'Sent both fieldsUse one or the other
Either 'prompt' or 'multi_prompt' must be providedNeither sent, or empty arrayProvide at least one
Field requiredmulti_prompt item missing promptEvery shot needs a prompt string
duration value '2' is invalidTotal duration < 3 secondsEnsure total across shots >= 3
Total shot duration (16s) exceeds maximum allowed (15s)Total duration > 15 secondsKeep total at 15 seconds or less
Input should be '1', '2', ... or '15'Single shot > 15Keep each shot at 15 seconds or less
num_generations must be an integer between 1 and 4Invalid count (via /ai/queue)Use 1-4

Other Kling Models

ModelInputUse CaseCost/sec
kling-video-v2-6-pro-text-to-videoText onlySimple text-to-video$0.070
kling-video-v2-6-pro-image-to-videoImageAnimate a single image$0.070
kling-video-o3-pro-image-to-videoImage + textHigher quality image animation$0.224
kling-video-o3-pro-reference-to-videoImages + textReference-conditioned, multi-shot$0.224
kling-video-o3-pro-video-to-video-editVideoEdit existing video$0.336
kling-video-v3-pro-motion-controlText + image + videoCamera/motion control$0.168
The O3 Pro models produce higher quality output than v2.x but cost roughly 3x more per second.