Intro
After migrating to Claude, the bot was doing well at Q&A. Then someone asked: "Can we also recommend similar past cases when someone's planning a new project?"
The idea: our team has an archive of ~600 past campaign pages in a headless CMS, tagged with categories, tech used, design approach, etc. When someone submits a pre-proposal consultation, the bot should find similar past cases and recommend them.
Simple enough. Then came the architecture debate.
The "Separate Bot" Debate
A colleague suggested building this as a completely separate bot. The reasoning made sense on paper:
- Different purpose (Q&A vs. case search)
- Different data source (knowledge markdown vs. CMS archive)
- Different prompt
- Mixing them would cause prompt bloat, cross-contamination, and coupled failure modes
Those are legitimate concerns. But looking at the existing codebase, the bot already handles three distinct flows (channel mentions, DMs, workflow submissions) with different prompts and different data, all routed by a single message handler. The separation they were worried about already existed at the code level, just not at the infrastructure level.
The cost of a separate service was harder to justify: double the Cloud Run instances, double the GCS buckets, double the CI/CD pipelines, double the secrets management, and double the things to monitor. All for something that shares the same Slack app, same GCP project, and same authentication.
We landed on a middle ground: same service, same repo, but completely separate handler + prompt + data path. The features are isolated in code, just not in infrastructure. And if it causes problems down the line, splitting it out is always an option since it's easier to split later than to merge later.
Why Not FAISS?
My first instinct was to build a proper similarity search with FAISS or some vector database. Then I looked at the actual data:
- ~600 records
- Each record is mostly tags (category, tech stack, design style) with a title and URL
- Total data size: ~400KB of JSON
- Claude's context window: 200K tokens
400KB of structured tag data fits comfortably in Claude's context. And the data is tag-based, not free text - embedding models aren't great at computing meaningful distances between categorical tags like ["sale", "seasonal"] vs ["new-store-opening"].
Claude, on the other hand, understands the semantic relationship between these tags natively. It can reason about "the user wants a seasonal sale page with animation" and match that against tags across 600 records better than cosine similarity on tag embeddings would.
The approach: Pass all 600 records directly in the system prompt. One API call, no vector database, no embedding model, no index management. If the archive grows to thousands of records, we can revisit. For now, this is simpler and works better.
Making It Token-Efficient
Passing raw JSON would waste tokens. Instead, I format each record as a compact one-liner:
1# Fields to include in context: (json_key, display_label)2CONTEXT_FIELDS = [3 ("category", "cat"),4 ("tags", "tags"),5 ("date", "date"),6 ("url", "url"),7]89def get_context_text(self) -> str:10 lines = [f"[Case Archive: {len(self.data)} total]"]1112 for i, case in enumerate(self.data, 1):13 parts = [f"{i}. {case.get('title', 'N/A')}"]1415 for key, label in CONTEXT_FIELDS:16 value = case.get(key)17 if not value:18 continue19 if isinstance(value, list):20 parts.append(f"{label}:{', '.join(value)}")21 elif key == "date":22 parts.append(f"{label}:{value[:10]}")23 else:24 parts.append(f"{label}:{value}")2526 lines.append(" | ".join(parts))2728 return "\n".join(lines)
Output looks like:
1[Case Archive: 600 total]21. Holiday Campaign 2026 | cat:seasonal | tech:scroll-animation | design:parallax | date:2026-03-01 | url:https://...32. Product Launch Page | cat:launch | tech:vanilla-js | date:2026-02-15 | url:https://...
Each record is one line, ~100-200 tokens. 600 records = ~60K-120K tokens in the system prompt. Well within limits, and since the data only changes weekly, Anthropic's prompt caching kicks in across requests.
The Data Pipeline
The case archive lives in a headless CMS with a GraphQL API. I needed to:
- Fetch the data weekly
- Transform it to a clean JSON format
- Upload to GCS (same bucket as the knowledge base)
The Fetch Script
A standalone Python script using only stdlib (no app dependencies, since it runs in CI):
1#!/usr/bin/env python323import json4import sys5import urllib.request6from pathlib import Path78API_URL = "https://cms.example.com/api/cases"910def fetch_cases() -> list[dict]:11 req = urllib.request.Request(API_URL)12 with urllib.request.urlopen(req, timeout=30) as response:13 raw = json.loads(response.read().decode("utf-8"))1415 items = (raw.get("data") or {}).get("cases") or []1617 # Filter out entries without a title, sort newest first18 result = [item for item in items if item.get("title")]19 result.sort(key=lambda x: x.get("date", ""), reverse=True)20 return result2122def main() -> None:23 output_path = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("archive.json")2425 data = fetch_cases()2627 if not data:28 print("Error: No data received", file=sys.stderr)29 sys.exit(1)3031 output_path.write_text(32 json.dumps(data, ensure_ascii=False, indent=2) + "\n",33 encoding="utf-8",34 )35 print(f"Done: {len(data)} cases written to {output_path}")
Key design decisions:
- Stdlib only (
urllib,json) - norequestsdependency. This runs in a bare CI environment, and addingrequestsjust for one HTTP call isn't worth it. - Output path as argument - the script writes to wherever you tell it, so CI can write to a temp file before uploading.
- Null-safe chaining - the CMS GraphQL response can have
Noneat any level, so(x or {}).get(...)everywhere.
The GitHub Actions Workflow
1name: Deploy Case Archive Data (CMS → GCS)23on:4 schedule:5 - cron: '0 1 * * 1' # Monday 10:00 JST6 workflow_dispatch:78env:9 GCS_BUCKET: your-bot-memory10 GCS_PATH: data/case_archive.json1112jobs:13 fetch-and-upload:14 runs-on: ubuntu-latest15 permissions:16 contents: read17 id-token: write1819 steps:20 - uses: actions/checkout@v62122 - uses: actions/setup-python@v623 with:24 python-version: '3.12'2526 - uses: google-github-actions/auth@v327 with:28 workload_identity_provider: ${{ secrets.WIF_PROVIDER }}29 service_account: ${{ secrets.WIF_SERVICE_ACCOUNT }}30 token_format: access_token3132 - uses: google-github-actions/setup-gcloud@v33334 - name: Fetch data from CMS35 run: python scripts/fetch_archive.py archive.json3637 - name: Upload to GCS38 run: gsutil cp archive.json gs://${{ env.GCS_BUCKET }}/${{ env.GCS_PATH }}
No PR, no JSON committed to the repo, no merge step. The script fetches from the CMS and uploads straight to GCS. If you already have Pub/Sub auto-reload set up, the bot picks up the new data automatically.
How I Got Here: Three Approaches to Data Storage
This wasn't the first design. I went through three iterations before landing on direct GCS upload.
Attempt 1: Commit JSON to the repo, bake into Docker image
The initial approach. The weekly GHA would fetch from the CMS, commit the JSON to the repo, create a PR for review, and after merge the deploy workflow would build a new Docker image with the data baked in.
This worked, but it felt wrong immediately:
- 400KB JSON file polluting git history, growing every week
- Every data update triggers a full Docker rebuild + Cloud Run redeploy
- Weekly PRs with 600-line JSON diffs that nobody actually reviews
- The
Dockerfileneeded aCOPY data/ ./data/line just for one file
Attempt 2: Commit to repo, upload to GCS on deploy
Next I tried a hybrid: keep the JSON in the repo for reviewability, but read it from GCS at runtime instead of from the filesystem. The deploy workflow would gsutil cp the JSON to GCS alongside the Docker deploy.
Better - no more baking data into the image. But it still had the git history bloat and the weekly noise PRs. And the deploy workflow was doing two unrelated things (deploy app + upload data) which meant a data-only change still triggered a full deploy.
Attempt 3 (final): Direct CMS → GCS, no repo involvement
The realization: nobody was actually reviewing those JSON diffs. 600 records of CMS-generated tag data isn't something you eyeball for correctness. If the CMS data is wrong, you'd catch it from the bot's responses, not from a PR diff.
So I cut the repo out entirely. The GHA fetches from the CMS and uploads straight to GCS. The JSON never touches git. The app reads it from GCS at runtime via the same MemoryService that loads knowledge files.
Benefits:
- Zero git history bloat
- No weekly noise PRs
- Data updates are decoupled from app deploys
- The existing
OBJECT_FINALIZE→ Pub/Sub →/admin/reload-memorypipeline handles cache invalidation automatically - If you need to debug the data,
gsutil cat gs://your-bucket/data/archive.jsonis right there
Routing by Workflow
The existing bot routes messages by channel + workflow ID. Adding the new feature was the same pattern:
1# app/handlers/router.py23@app.event("message")4def _message(event, say, client, logger):5 bot_id = event.get("bot_id")6 channel = event.get("channel")78 # Route 1: Knowledge submission workflow9 if bot_id and channel == KNOWLEDGE_CHANNEL_ID:10 # ... dispatch to knowledge handler1112 # Route 2: Case consultation workflow13 if bot_id and channel == CONSULTATION_CHANNEL_ID:14 metadata = event.get("metadata", {})15 workflow_id = metadata.get("event_payload", {}).get("workflow_id")1617 if CONSULTATION_WORKFLOW_ID and workflow_id != CONSULTATION_WORKFLOW_ID:18 return1920 if _is_duplicate(f"consultation:{event.get('event_ts')}"):21 return2223 handle_consultation(event, say, client, logger)24 return2526 if bot_id:27 return2829 # Route 3: DMs30 if event.get("channel_type") == "im":31 handle_dm(event, say, logger)
Each route checks channel + workflow ID, deduplicates, and delegates to a feature handler. The router doesn't know or care what each handler does.
Parsing Workflow Form Messages
Slack Workflow forms arrive as plain text with field labels. I parse out just the consultation content:
1@dataclass2class ConsultationFormData:3 consultation: str4 reference_url: Optional[str] = None567def parse_consultation_message(text: str) -> ConsultationFormData:8 consultation_lines = []9 reference_url = None10 current_field = None1112 for line in text.strip().split("\n"):13 if "Reference URL" in line:14 current_field = "url"15 elif "Consultation" in line:16 current_field = "consultation"17 elif current_field == "url" and not reference_url:18 url_match = re.search(r"https?://[^\s<>]+", line)19 if url_match:20 reference_url = url_match.group(0)21 elif current_field == "consultation":22 consultation_lines.append(line)2324 consultation = "\n".join(consultation_lines).strip()2526 # Fall back to full text if parsing fails27 if not consultation:28 consultation = text.strip()2930 return ConsultationFormData(31 consultation=consultation,32 reference_url=reference_url,33 )
The fallback is important - if the message format changes, the bot degrades to passing the full text to Claude rather than breaking entirely.
The Refactoring That Came After
Adding this feature exposed a structural problem. The original codebase had:
slack_events.py(290 lines) - routing + Q&A logic + feedback handlers + help commandclaude.py(250 lines) - Q&A generation + document generation + case consultation, all in one class
Every feature was mixed into two god files. Adding the consultation feature made it obvious.
Splitting the Services
The AI service was doing three unrelated things with one shared Anthropic client. I split it:
1Before:2 claude.py (ClaudeService - does everything)34After:5 claude_client.py → Shared Anthropic client + send_message()6 qa_service.py → Q&A: system prompt + memory + chat history7 knowledge_service.py → Document generation: DCP prompt + GCS8 consultation_service.py → Case search: archive prompt + CMS data
Each service owns its own prompt, data loading, and error handling. They share only the Anthropic client.
Splitting the Handlers
Same idea for the Slack handlers:
1Before:2 slack_events.py (routing + Q&A + feedback + help, all inline)34After:5 router.py → Routing + dedup only6 qa_handler.py → Mention + DM handling7 feedback_handler.py → Thumbs up/down8 knowledge_handler.py → Knowledge submission (renamed from workflow_handler.py)9 consultation_handler.py → Case consultation
The router became a thin dispatch layer. Each handler file has one job.
Was It Worth It?
98 tests, zero behavior changes, same test count before and after. The refactoring was purely structural. But now when I look at the codebase, I can tell what each file does from its name. Adding a fifth feature would be: one new handler, one new service, one new route in the router.
Wrapping Up
The full pipeline:
- Weekly: GitHub Actions fetches from CMS → uploads JSON to GCS
- On GCS update: Pub/Sub notifies Cloud Run → bot reloads archive data
- On consultation: User submits Slack Workflow form → bot parses consultation text → Claude matches against 600 archived cases in context → bot responds with 3-5 similar cases + rationale
Key takeaways:
- Don't build a separate service when a route will do - Same Slack app, same GCP project, same auth. A new handler is cheaper than a new service in every dimension.
- FAISS is overkill for small, tag-based datasets - 600 records of categorical tags fit in Claude's context window. Let the LLM do the semantic matching instead of building a similarity search pipeline.
- Separate data pipelines from app deployments - CMS data changes weekly, app code changes irregularly. Don't couple them. GCS + Pub/Sub decouples data freshness from deploy cycles.
- God files are a smell, not a crisis - The single
claude.pyworked fine for two features. It got painful at three. Refactor when it hurts, not before. - Prompt caching makes context-heavy approaches viable - Passing 600 records in every system prompt sounds expensive, but with prompt caching, the static data is cached across requests. You only pay full price once per cache window.
