Hey, wanted to introduce something cool we've been working on for a few months
You can now deploy almost any LLM from Huggingface at 3-10x the speed you'd get with HF Inference / VLLM. It takes around 5 minutes to spin up from the time you type the model name in and click deploy, and deploys on either H100 and H200 GPUs.
Since the demand for many different LLM types has increased exponentially, and there are also concerns about the privacy of serverless inference - we basically wanted to create a product where you can make your own private, production grade deployments for any model. The cool thing about this is there are no logs, since there is no need to count tokens for billing or anything (also, no metrics unfortunately, because of that), but fully private.
Currently it supports around 100 model architectures, and we're adding more, with multimodal coming in the future.
Financially, if you have a lot of traffic, it works out much cheaper to have a dedicated deployment with provisioned GPUs, with Llama 3.1 8B for example, running at full saturation (around 50k tk/s), the net effective price is $0.01 per million tokens.
Of course, it supports Lora Merging, and the endpoints are fully OpenAI API compatible -> so you can maybe create your own private Deepseek R1 deployment if needed. Autoscaling is taken care of out of the box, so in case your app or service gets a huge traffic spike, you're covered.
Happy to announce this breakthrough, made largely possible by Nvidia's H200 SXMs and a proprietary speculative decoding algorithm.
We've launched a production grade API endpoint at $3 per million tokens. We also have some capacity for fine tuning 405B, while still keeping the speed increases, so if you're interested please get in touch.
For analysis. In the future we'll also offer syncing to, for example, BigQuery, Redshift, Snowflake and all the other data warehouses. However most of the marketers we interviewd analyse and create charts of their data in Sheets or Excel.
This was from people we already hired - as in people who were employees of the company and wanted to build a new product. We're currently looking to maybe add another co-founder level person, and such looking to meet new people.
Context: I thought it would be an interesting experiment with the team when throwing our hats into this bucket, to try an eco approach regarding the scaling / growth process - a la, the more words users write, the more trees we plant.
Heavy users who write a lot of words using the product get rewarded by planting more trees, and so the cycle continues.
Development was done by a core team that has worked on other successful SAAS products. Overall I'm optimistic but would love to hear your nitpicks / thoughts.
You can now deploy almost any LLM from Huggingface at 3-10x the speed you'd get with HF Inference / VLLM. It takes around 5 minutes to spin up from the time you type the model name in and click deploy, and deploys on either H100 and H200 GPUs.
Since the demand for many different LLM types has increased exponentially, and there are also concerns about the privacy of serverless inference - we basically wanted to create a product where you can make your own private, production grade deployments for any model. The cool thing about this is there are no logs, since there is no need to count tokens for billing or anything (also, no metrics unfortunately, because of that), but fully private.
Currently it supports around 100 model architectures, and we're adding more, with multimodal coming in the future.
Financially, if you have a lot of traffic, it works out much cheaper to have a dedicated deployment with provisioned GPUs, with Llama 3.1 8B for example, running at full saturation (around 50k tk/s), the net effective price is $0.01 per million tokens.
Of course, it supports Lora Merging, and the endpoints are fully OpenAI API compatible -> so you can maybe create your own private Deepseek R1 deployment if needed. Autoscaling is taken care of out of the box, so in case your app or service gets a huge traffic spike, you're covered.
If you're interested you can try it out at https://new.avian.io/dedicated-deployments