Bring serverless GPU inference service to Hugging Face users

We will integrate some of the most popular open models on Hugging Face into Cloudflare Workers AI, all thanks to our production environment deployment solution, such as Text Generation Inference (TGI).

Text Generation Inference (TGI)
https://github.com/huggingface/text-generation-inference/
By deploying to the Cloudflare Workers AI service, developers can build powerful generative AI applications at a very low operating cost without managing GPU infrastructure and servers. You only need to pay for actual computational consumption without paying for idle resources.

Developers' Generative AI Tools
This new service is based on our strategic partnership with Cloudfalre announced last year, which simplifies the access and deployment process of open generative AI models. Developers and organizations face a major problem - scarce GPU resources and fixed costs of deploying servers.

Strategic Partnership
https://blog.cloudflare.com/partnering-with-hugging-face-deploying-ai-easier-affordable/
Deployment on Cloudflare Workers AI provides a simple and cost-effective solution, offering a serverless access and running solution for Hugging Face models through a pay-as-you-go billing model.

Pay-as-you-go Billing Model
https://developers.cloudflare.com/workers-ai/platform/pricing
For example, let's say you develop a RAG application that handles approximately 1000 requests per day, with each request containing 1000 token inputs and 100 token outputs, using the Meta Llama 2 7B model. The production cost of such LLM inference is about $1 per day.

Cloudflare Pricing Page
We are excited to achieve this integration so quickly. Combining the serverless GPU capabilities in Cloudflare's global network with the most popular open-source models on Hugging Face will bring a lot of exciting innovations to our global community.

John Graham-Cumming, CTO of Cloudflare

Usage
Using Hugging Face models on Cloudflare Workers AI is very simple. Here is a step-by-step guide on how to use Hermes 2 Pro on the latest Nous Research model Mistral 7B.

You can find all available models in the Cloudflare Collection.

Cloudflare Collection
https://hf.co/collections/Cloudflare/hf-curated-models-available-on-workers-ai-66036e7ad5064318b3e45db6
Note: You need to have a Cloudflare account and API token.

Cloudflare Account
https://developers.cloudflare.com/fundamentals/setup/find-account-and-zone-ids/
API Token
https://dash.cloudflare.com/profile/api-tokens
You can find the "Deploy to Cloudflare" option on all supported model pages, including models like Llama, Gemma, or Mistral.

Open the "Deploy" menu and select "Cloudflare Workers AI", which will open a page with instructions on how to use this model and send requests.

Note: If the model you want to use does not have the "Cloudflare Workers AI" option, it means it is currently not supported. We are working with Cloudflare to expand the availability of models. You can contact us to submit your request.

There are currently two ways to use this integration: through the Workers AI REST API or directly in Workers using the Cloudflare AI SDK. Choose your preferred method and copy the code into your environment. When using the REST API, make sure to define the ACCOUNTID and APITOKEN variables.

Workers AI REST API
https://developers.cloudflare.com/workers-ai/get-started/rest-api/
Cloudflare AI SDK
https://developers.cloudflare.com/workers-ai/get-started/workers-wrangler/#1-create-a-worker-project
ACCOUNTID
https://developers.cloudflare.com/fundamentals/setup/find-account-and-zone-ids/
APITOKEN
https://dash.cloudflare.com/profile/api-tokens
That's it! Now you can start sending requests to the Hugging Face models hosted on Cloudflare Workers AI. Make sure to use the correct prompts and templates expected by the model.