AI Engineering

你的 Gemini Pro 可能早就偷偷變成 Flash 了 — Vertex AI 靜默降級的三個坑

我們的企業 AI 助理跑了兩週的「Gemini 3.1 Pro」，直到用戶報告數字全是假的，才發現所有 request 其實都在跑 Flash。

你的 Gemini Pro 可能早就偷偷變成 Flash 了 — Vertex AI 靜默降級的三個坑

我的企業 AI 助理跑了三天的「Gemini 3.1 Pro」，直到用戶報告數字全是假的，才發現所有 request 其實都在跑 Flash。

事件經過

我們在 GCP Cloud Run 上跑一個企業 AI 助理（ERIKA），用 LiteLLM 作為 LLM gateway，後面接 Vertex AI 的 Gemini 模型。

架構大概長這樣：

User → Cloud Run (FastAPI) → LiteLLM → Vertex AI (Gemini)

config 設定：

LLM_MODEL = "gemini/gemini-3.1-pro-preview"
LLM_FALLBACK_MODELS = "gemini/gemini-2.5-pro-preview-05-06,gemini/gemini-2.5-flash"

看起來很合理對吧？Primary 用 3.1 Pro，失敗就 fallback 到 2.5 Pro，再失敗用 Flash 保底。

但實際上，三個模型裡只有最弱的 Flash 能用。

坑 1：Preview Model 靜默過期

gemini-2.5-pro-preview-05-06 這個 model name 是帶日期的 preview 版本。它的有效期到了之後，Vertex AI 不會發 email、不會送 deprecation warning、不會在 console 標紅字。

它就是直接回 404。

HTTP 404: Model not found

你的 fallback chain 第一個候選就這樣廢了，而你完全不知道。

教訓：用 stable model name。 gemini-2.5-pro 而不是 gemini-2.5-pro-preview-05-06。Google 的 stable name 有明確的 retirement date（通常一年），preview name 隨時可能消失。

坑 2：Gemini 3.x 只在 Global Endpoint 提供

這是我們花最多時間 debug 的坑。

我們的 Cloud Run 設定 GCP_LOCATION=us-central1，LiteLLM 就會把 request 打到：

https://us-central1-aiplatform.googleapis.com/v1/projects/{project}/locations/us-central1/publishers/google/models/gemini-3.1-pro-preview

但 Gemini 3.x 系列（3、3 Pro、3 Flash、3.1 Pro、3.1 Flash）在 Vertex AI 上只在 global endpoint 提供，不是 regional endpoint。

正確的 URL 應該是：

https://aiplatform.googleapis.com/v1/projects/{project}/locations/global/publishers/google/models/gemini-3.1-pro-preview

打 regional endpoint → 404 → fallback。

這在 Google 的文件裡有寫，但藏在 Model endpoint locations 這個頁面裡，你不特別去找不會看到。

我們的修正：

def _build_request_kwargs(self, model, ...):
    if model.startswith("vertex_ai/"):
        # Gemini 3.x requires global endpoint
        if "gemini-3" in model:
            request_kwargs["vertex_location"] = "global"
        else:
            request_kwargs["vertex_location"] = self.vertex_location

坑 3：LiteLLM 靜默 Fallback

這是最陰險的。

LiteLLM 的 fallback 機制設計上是為了 availability — Primary 失敗就自動試下一個。這在 rate limit (429) 或暫時性錯誤 (5xx) 時很有用。

但當你的 Primary 和第一個 Fallback 都是 404 時，它會一路 fallback 到最後一個能用的模型（Flash），然後照常回傳 200 response。

在 application level，你看到的是：

request 成功了 ✓
response 正常回傳了 ✓
用戶看到答案了 ✓

唯一的線索是 DEBUG level 的 log：

LiteLLM completion() model= gemini-3.1-pro-preview; provider = vertex_ai
...
Model=gemini-2.5-flash; cost=0.02140088

第一行說它要打 3.1 Pro，最後一行說它實際用了 Flash。但中間的 fallback 過程只在 DEBUG log 裡。

你的 production 就這樣從 Pro 降級成 Flash，而你的 monitoring 什麼都看不到。

Bonus 坑：Gemini 3.x + Low Temperature = 退化輸出

修好 endpoint 之後，我們遇到第四個坑。

我們的資料分析路徑設定 temperature=0.2（希望輸出穩定、少創意）。但 Gemini 3.x 在 low temperature 下會產生退化輸出 — 具體來說是無限重複的 --- 橫線，把整個頁面填滿。

LiteLLM 自己也有 warning：

Warning: Setting temperature < 1.0 for Gemini 3 models (gemini-3.1-pro-preview)
can cause infinite loops, degraded reasoning performance, and failure on complex tasks.
Strongly recommended to use temperature >= 1.0

但這個 warning 是 INFO level，production 裡很容易忽略。

我們的修正 — 自動 clamp：

# Gemini 3.x requires temperature >= 1.0
if "gemini-3" in model and temperature < 1.0:
    effective_temp = 1.0

用戶看到的症狀

因為以上三個坑（404 → fallback Flash → low temperature），用戶看到的是：

數字幻覺：所有數字都是整數（1,200 / 1,500 / 500），而且互相一致（$3,990 × 1,200 = $4,788,000）。Flash 的推理能力不足以處理資料分析任務，但它很擅長「捏造一組看起來合理的數字」。
Markdown 爆炸：整個頁面被重複的 --- 橫線填滿（幾百行），因為 Flash + low temperature 導致 token 退化循環。
前端 badge 顯示「Gemini 3.1 Pro」：因為前端是讀 config，不是讀實際使用的 model。掛羊頭賣狗肉。

我們做了什麼修正

後端

問題	修正
Gemini 3.x endpoint	偵測 `gemini-3` → 自動用 `global` endpoint
Low temperature 退化	偵測 `gemini-3` → clamp `temperature >= 1.0`
Preview model 過期	移除 preview model，改用 stable name

前端

三層 markdown 防禦（因為你永遠不能信任 LLM 輸出的格式）：

Pre-process（text level）：normalize 所有 HR 變體，collapse 重複的 ---
Hard cap：單則訊息超過 5 個 --- → 全部清除
DOM cleanup（render 後）：walk DOM tree，移除 HR cluster 和空 <p> 標籤
串流也走完整 pipeline：之前只有最終渲染走 cleanup，串流中間狀態沒有

Checklist：你可能也中了

如果你用 Vertex AI + LiteLLM（或任何 LLM gateway），建議檢查：

[ ] 你的 model name 是 stable 還是 preview？preview 有沒有過期？
[ ] 你的 Gemini 3.x 打的是 global 還是 regional endpoint？
[ ] 你的 fallback 機制有沒有 alerting？還是靜默降級？
[ ] 你的 response 有沒有記錄實際使用的 model name（不是 config 裡的）？
[ ] 你的 Gemini 3.x 的 temperature 設多少？低於 1.0 嗎？
[ ] 你的前端顯示的 model name 是讀 config 還是讀 response？

結論

LLM infra 最危險的 failure mode 不是「掛掉」，而是「降級但沒人知道」。

你的系統可能現在就在跑一個比你預期弱很多的模型，產出看起來合理但不正確的答案，而你的 monitoring 一切綠燈。

去查你的 logs 吧。

Maki 用 Gemini 3.1 Pro（這次是真的）跑企業 AI 助理的踩坑紀錄

你的 Gemini Pro 可能早就偷偷變成 Flash 了 — Vertex AI 靜默降級的三個坑

你的 Gemini Pro 可能早就偷偷變成 Flash 了 — Vertex AI 靜默降級的三個坑

事件經過

坑 1：Preview Model 靜默過期

坑 2：Gemini 3.x 只在 Global Endpoint 提供

坑 3：LiteLLM 靜默 Fallback

Bonus 坑：Gemini 3.x + Low Temperature = 退化輸出

用戶看到的症狀

我們做了什麼修正

後端

前端

Checklist：你可能也中了

結論

Read next

本地跑萬億參數模型，雲端算力危機來了？

開源大模型狂衝，成本與可控成新賽道

模型抽象層：打破供應商鎖定的祕密武器