23 · 路由 / 缓存 / 计费引擎规格（核心）

1配置 schema（声明式，Zod 校验）

逻辑模型 → 候选路由是引擎行为的核心配置。存 D1（见 22），可由可版本化 JSON 同步。结构：

// logical-models.config.json（示意；运行时以 D1 为准，文件用于版本化同步）
{
  "cheap-default": {
    "tier": "cheap", "multiplier": 1.0, "cacheTtl": 3600,
    "routes": [
      { "channel": "ch_deepseek",   "model": "deepseek/deepseek-v3.2", "priority": 1, "weight": 70 },
      { "channel": "ch_openrouter", "model": "deepseek/deepseek-v3.2", "priority": 1, "weight": 30 },
      { "channel": "ch_groq",       "model": "llama-3.3-70b",          "priority": 2, "weight": 100 }
    ]
  },
  "free-fallback": {
    "tier": "free", "multiplier": 0, "cacheTtl": 86400,
    "routes": [
      { "channel": "ch_workers_ai", "model": "@cf/meta/llama-3.1-8b", "priority": 1, "weight": 100 },
      { "channel": "ch_openrouter", "model": "meta-llama/llama-3.3-70b:free", "priority": 2, "weight": 100 }
    ]
  },
  "smart": {
    "tier": "premium", "multiplier": 8.0, "cacheTtl": 0,
    "routes": [
      { "channel": "ch_openrouter", "model": "anthropic/claude-haiku-4.5", "priority": 1, "weight": 100 },
      { "channel": "ch_openrouter", "model": "google/gemini-2.5-flash",    "priority": 2, "weight": 100 }
    ]
  },
  "embed-default": {
    "tier": "embedding", "multiplier": 1.0, "cacheTtl": 604800,
    "routes": [ { "channel": "ch_workers_ai", "model": "@cf/baai/bge-m3", "priority": 1, "weight": 100 } ]
  }
}

🎯 设计要点

逻辑模型名是产品唯一感知的东西。产品请求 cheap-default，运维随时改它背后映射哪些真实模型/渠道、权重、缓存——产品代码不动。这把"模型选择"从产品解耦到运维配置。

2路由与回退算法

function resolveCandidates(logicalModel):
  routes = D1.routes(logicalModel).where(enabled).orderBy(priority asc)
  groups = groupBy(routes, r => r.priority)          // 同优先级一组
  ordered = []
  for g in groups by priority asc:
    ordered += weightedShuffle(g, by=weight)         // 组内按权重随机打散
  return ordered                                      // [首选..兜底] 的有序候选

async function dispatch(req, logicalModel):
  candidates = resolveCandidates(logicalModel)
  if candidates.empty: return 503 no_available_channel
  lastErr = null
  for (i, cand) in candidates:                        // 依序尝试 → 回退
    try:
      resp = await providerAdapter(cand.channel).call(req, cand.model)   // 经 AI Gateway
      mark(resp, channel=cand, fallback=(i>0))
      return resp
    catch e where e.status in [429, 500, 502, 503, 504] or e.timeout:
      lastErr = e; continue                           // 可回退错误 → 下一个候选
    catch e:                                          // 4xx(非429) 不回退（请求本身问题）
      return e
  return 502 upstream_error (lastErr)

意图升级（可选）：产品可在 metadata.intent 标注 complex，引擎将逻辑模型按映射切到更高 tier（如 cheap-default→smart）。默认不升级，保守省钱。

3缓存策略

响应缓存（KV）：仅当逻辑模型 cacheTtl > 0 且请求 stream=false 且 temperature ≤ 0.2（确定性输出）才启用。
cacheKey = sha256(logicalModel + 规范化(messages) + max_tokens + temperature)；规范化=去除无关空白、统一字段顺序。
命中：直接返回，x-gw-cache: hit，记账 cost=0, cache_hit=1，仍计入用量条数。
prompt 缓存透传：对支持的上游（Anthropic 显式、OpenAI/DeepSeek/Gemini 隐式）透传缓存控制，命中由上游计价（约 0.25× 输入）。
失效：TTL 自然过期；配置变更时按 logicalModel 前缀清理（Cron/手动）。

// 缓存判定
cacheable = lm.cacheTtl > 0 && req.stream === false && (req.temperature ?? 1) <= 0.2
key = "cache:" + sha256(lm.id + canonical(req.messages) + req.max_tokens + req.temperature)
hit = cacheable ? await KV.get(key) : null
if hit: return withHeaders(hit, {"x-gw-cache":"hit"})
// ...调用上游后：
if cacheable: await KV.put(key, resp, { expirationTtl: lm.cacheTtl })

4计费与配额

成本与额度公式

// 上游真实成本（USD）
cost_usd = in_tokens/1e6 * model.in_price + out_tokens/1e6 * model.out_price
// 缓存命中：cost_usd = 0
// 折算"额度单位"（用于对内配额/对未来计费）
billed_units = cost_usd * logicalModel.multiplier
// 记一行 requests（见 22），并扣减 quota
quota.used_units += billed_units   // 原子更新；day/month 两个周期各扣

配额检查（请求前置）

q = D1.quota(apiKey, 'day')
if q && q.used_units >= q.limit_units: return 429 quota_exceeded (Retry-After: secondsTo(q.reset_at))
// 通过后正常处理；处理完按实际 billed_units 回扣（先放行后结算，避免阻塞）

倍率参考 06：免费档 0、便宜档 1×、premium 8×、flagship 20×。以厂商返回 usage 为准记账；月度与厂商账单对账（见 19 风险）。

5令牌桶限流（Durable Object）

每个 api_key 对应一个 Durable Object 实例，保证强一致计数（KV 最终一致不适合精确限流）。

// DO: RateLimiter（每 key 一个）
state: { tokens: number, lastRefill: number }
const RATE = key.rpm_limit / 60     // tokens/秒
const BURST = key.rpm_limit         // 桶容量

function take():                    // 每请求调用
  now = Date.now()/1000
  tokens = min(BURST, tokens + (now - lastRefill) * RATE)
  lastRefill = now
  if tokens >= 1: tokens -= 1; return { ok: true }
  else: return { ok: false, retryAfter: ceil((1 - tokens)/RATE) }
// 并发上限：DO 内维护 inFlight 计数，超 concurrency 即拒

超限返回 429 rate_limited + Retry-After + X-RateLimit-Remaining。

6Batch 异步流程（M3）

POST /v1/batch { logical_model, items:[{custom_id, messages}], webhook? } │ 校验 + 写 batch 记录(D1) + 拆条目入 Queue ▼ Cloudflare Queue ──► 批处理 Worker（消费者） │ 按 logical_model 路由到"支持 Batch 价"的上游（DeepSeek/Gemini 侧 Batch） │ 或在低峰以便宜模型逐条跑；结果写 D1/R2 ▼ GET /v1/batch/:id → { status: processing|completed, results_url }（或 webhook 回调）计费：按实际 usage 记账，Batch 价（约 50% off）直接体现为更低 cost_usd

MVP（M1/M2）先实现接口契约 + 单条同步兜底；M3 接 Queues 真异步与厂商 Batch 价。