PRICING

Build without limits

Flexible pricing for teams at every scale

Pay-as-you-go

For developers and small teams

Prompt optimization

  • 10 free successful optimizations per month
  • $20 for each additional successful optimization
  • 4 target models per run
  • SOTA performance and prompt portability

Intelligent routing

  • 10K free routing recommendations per month
  • $10 for every 10K additional recommendations
  • Pre-trained router for chat auto mode
  • 3 free custom routers

Custom

For enterprise teams ready to scale

  • Agent optimization
  • Bulk pricing
  • VPC deployments
  • Bring your own models
  • Custom evaluation metrics
  • Priority API job queue
  • More target models per run
  • More custom routers
  • Custom ZDR policies
  • 24/7 support

We offer discounts for startups and researchers

Frequently asked questions

Prompt optimization is a design-time, data-driven algorithm that takes your original static prompt template and your evaluation dataset and uses LLMs in an agentic loop to iterate over many variations of your prompt for each target model you’ve specified. The iterations are guided by reinforcement learning against the evaluation dataset and the algorithm leverages self-reflective improvements on the part of the optimizer agent. At the end of the optimization loop, a unique optimized prompt is returned for each target model together with a report of the accuracy improvements.

A successful prompt optimization is one in which we return a prompt that improves accuracy for a particular target model, relative to the performance of your original prompt on that model. If we’re not able to deliver any accuracy improvements on your target model, we don’t bill you anything other than the LLM inference costs incurred during optimization.

Prompt optimization incurs inference costs while running the optimization algorithm, and these are passed through directly to the user at cost. Depending on the amount of data you provide, the cost of the target models you’re optimizing against, and whether you are using an LLM-as-a-judge for evaluation, each target model can incur anywhere from a few cents to a few dollars (or more) in inference costs. We provide a best-effort cost estimate based on typical token usage and run behavior, but actual inference cost may differ due to the non-deterministic nature of LLMs.

Prompt optimization is extremely data efficient and can work with as few as three data samples. However, in order to maximize the accuracy of your optimized prompts, we recommend providing more data if you have access to it.

Prompt optimization requires between 5 and 45 minutes of background processing depending on the amount of data you provide and the latency of the target models. Optimizations are parallelized, so additional target models do not increase the latency of runs. In comparison, most AI product engineers and data scientists estimate needing between 8 and 40 hours to optimize a prompt manually for a single model.

Agent optimization extends our prompt optimization algorithm to optimize your autonomous agents and multi-step workflows end-to-end. Agent optimization is currently in beta. If you’d like to test it, please contact us.

Intelligent routing is an ultra low-latency runtime optimization that dynamically predicts which LLM to use for each incoming input, maximizing accuracy while reducing cost and latency. Our algorithm takes in a set of inputs, the corresponding responses for the models we want to route between, and scores for those responses and learns a mapping from inputs to rankings for models. Default rankings maximize accuracy, but these rankings can be flexibly adjusted for cost and latency tradeoffs. For Custom plans, this trade off can be granularly controlled via Pareto optimization. We offer both a pre-trained, out-of-the box router and the ability to train your own custom router on your own data.

A routing recommendation occurs when you submit a dynamic input and the list of models you want to route between, after which Not Diamond will return a recommendation for which model to use on that particular input. In other words, every input to your AI application will trigger a routing recommendation.

The latency of each routing recommendation will range from 10–100ms depending on the amount of data used to train your router. Additional network latency may be incurred depending on your infrastructure setup. Please reach out to us if you have specific latency requirements.

Prompt optimization allows teams to get far better performance out of much smaller models, while intelligent routing offers additional flexible cost and latency controls which you can adjust to fit your business needs. Depending on the use case, we see teams achieving anywhere up to 10–100x inference savings using Not Diamond.

Not Diamond is stack-agnostic and is designed to integrate with your existing toolchain. As the intelligent optimization layer in your stack, we are not a gateway, a proxy, an inference provider, or an evaluation pipeline. Rather, we leverage your existing evaluation metrics and models to help you optimize how to prompt each model and when to call it. Inference requests are then handled client-side, whether through an internal gateway, an external one, or by calling the providers directly.

Each Prompt Optimization run generates LLM inference costs which are billed to the user, and so we require a credit card on file so we can charge those inference costs. Upon signing up you may see a temporary pre-authorization hold from your bank or card issuer to verify the payment method, but any actual charges will come only from billable usage or inference costs.

Not Diamond is SOC-2 and ISO 27001 compliant. We provide custom ZDR policies, VPC deployments, and 24/7 on-call support to the most sophisticated AI teams in the world.

100x your AI dev cycles

Let the machine build the machine