Welcome to the Power Users community on Codidact!
Power Users is a Q&A site for questions about the usage of computer software and hardware. We are still a small site and would like to grow, so please consider joining our community. We are looking forward to your questions and answers; they are the building blocks of a repository of knowledge we are building together.
What's the maximum hit rate, if any, when using Claude, Gemini, Llama and Mistral via Google Cloud Compute?
+0
−5
What's the maximum hit rate, if any, when using Claude, Gemini, Llama and Mistral via Google Cloud Compute? (Example of maximum hit rate: 1M input tokens/minutes)
I don't use provisioned throughput.
I call Gemini as follows:
YOUR_PROJECT_ID = 'redacted'
YOUR_LOCATION = 'us-central1'
from google import genai
client = genai.Client(
vertexai=True, project=YOUR_PROJECT_ID, location=YOUR_LOCATION,
)
model = "gemini-2.5-pro-exp-03-25"
response = client.models.generate_content(
model=model,
contents=[
"Tell me a joke about alligators"
],
)
print(response.text, end="")
1 answer
+0
−0
Works for me
The following users marked this post as Works for me:
User | Comment | Date |
---|---|---|
Franck Dernoncourt | (no comment) | May 21, 2025 at 05:00 |
Quota from https://cloud.google.com/vertex-ai/docs/quotas:
Request type | Requests per minute |
---|---|
Resource management (CRUD) requests1 | 600 |
Job or long-running operation (LRO) submission requests | 60 |
Online prediction requests2 | 30,000 |
Online prediction request throughput | 1.5 GB |
Online explain requests | 600 |
Vertex AI TensorBoard Time Series read requests | 60,000 |
ML Metadata (CRUD) requests | 12,000 |
generative AI Caching (CRUD) requests | 200 |
Vertex AI Vizier (CRUD) requests | 6,000 |
Vertex AI Feature Store online serving requests | 300,000 |
Vertex ML Metadata requests | 12,000 |
Number of count tokens or compute tokens requests | 3,000 |
Thanks Googler m1nherz for pointing me to it.
3 comment threads