Intro
After adding similar case recommendations, the bot could answer team questions from markdown files in GCS and recommend past cases from CMS data. Two separate knowledge sources, both flowing into the same GCS bucket, both working well.
Then a problem surfaced: some of the team's knowledge lived in Confluence.
Not in the bot's GCS bucket. Not in markdown files anyone maintained. Just... in Confluence wiki pages, maintained by people who had no idea the bot existed. Asking them to also maintain a separate set of markdown files in GCS was never going to happen. And copy-pasting wiki content into markdown files every time someone updated a Confluence page was exactly the kind of manual process that rots over time.
The goal: Use Confluence as a knowledge source without duplicating content. Confluence stays the source of truth, GCS stays the bot's memory, and a sync script bridges the two.
What We're Building
Before: Knowledge exists in two places - GCS (bot reads this) and Confluence (humans maintain this). They don't talk to each other.
After: A weekly pipeline fetches Confluence pages, converts them to markdown, and uploads to GCS. Confluence remains the single source of truth. The bot picks up changes automatically.
Key features:
- Fetch only pages under a specific parent page (not the entire space)
- Convert Confluence HTML to clean markdown
- Filter out empty container pages
- Run locally for testing, run weekly in CI for production
Why Not Read Confluence Directly?
I considered three approaches:
| Approach | Pros | Cons |
|---|---|---|
| Sync to GCS | Zero bot code changes, Confluence is source of truth | Slight staleness (weekly sync) |
| Read at runtime | Always fresh | Runtime dependency on Confluence API, slower startup |
| Claude tool call | Scales better, on-demand | Complex, extra latency per query, prompt engineering |
Sync to GCS won because the bot's MemoryService already loads all .md files from the bucket. Drop markdown files under a new prefix, and the bot picks them up with zero code changes. The Confluence API only gets hit once a week in CI, not on every user question.
If the knowledge base grows large enough to blow past Claude's context window, that's when the tool-call approach becomes worth the complexity. We're not there yet.
The Confluence API
Authentication Setup
Confluence Cloud uses basic auth with an API token (not your account password):
- Go to https://id.atlassian.com/manage-profile/security/api-tokens
- Create a new token
- Your credentials are
email:token, base64-encoded in theAuthorizationheader
Fetching Pages Under a Parent
I didn't want every page in the space - that would be hundreds of unrelated pages. The Confluence REST API supports CQL (Confluence Query Language) with an ancestor filter that returns all descendants of a given page:
1GET /rest/api/content/search?cql=ancestor=12345678 AND type=page&expand=body.storage
The parent page ID is the number in the Confluence URL: https://your-domain.atlassian.net/wiki/spaces/SPACE/pages/12345678/Page+Title.
The Sync Script
The script does one thing: fetch Confluence pages and write them as local markdown files. No GCS upload, no Secret Manager calls, no framework dependencies beyond markdownify for HTML-to-markdown conversion.
1#!/usr/bin/env python323import argparse4import base645import json6import os7import shutil8import sys9import urllib.parse10import urllib.request11from pathlib import Path1213from markdownify import markdownify1415CONFLUENCE_URL = os.getenv("CONFLUENCE_URL", "")16CONFLUENCE_EMAIL = os.getenv("CONFLUENCE_EMAIL", "")17CONFLUENCE_API_TOKEN = os.getenv("CONFLUENCE_API_TOKEN", "")18CONFLUENCE_PARENT_PAGE_ID = os.getenv("CONFLUENCE_PARENT_PAGE_ID", "")1920PAGE_LIMIT = 2521MIN_CONTENT_LENGTH = 50 # Skip container pages222324def _auth_header() -> str:25 credentials = f"{CONFLUENCE_EMAIL}:{CONFLUENCE_API_TOKEN}"26 return f"Basic {base64.b64encode(credentials.encode()).decode()}"272829def fetch_descendant_pages() -> list[dict]:30 base = CONFLUENCE_URL.rstrip("/")31 cql = f"ancestor={CONFLUENCE_PARENT_PAGE_ID} AND type=page"32 start = 033 all_pages = []3435 while True:36 url = (37 f"{base}/rest/api/content/search"38 f"?cql={urllib.parse.quote(cql)}"39 f"&expand=body.storage"40 f"&limit={PAGE_LIMIT}"41 f"&start={start}"42 )43 req = urllib.request.Request(url, headers={44 "Authorization": _auth_header(),45 "Accept": "application/json",46 })47 with urllib.request.urlopen(req, timeout=30) as resp:48 data = json.loads(resp.read().decode("utf-8"))4950 results = data.get("results", [])51 all_pages.extend(results)5253 if len(results) < PAGE_LIMIT:54 break55 start += PAGE_LIMIT5657 return all_pages
Converting HTML to Markdown
Confluence stores page content as HTML (the "storage" format). markdownify handles the conversion:
1def page_to_markdown(page: dict) -> str:2 title = page.get("title", "Untitled")3 html_body = page.get("body", {}).get("storage", {}).get("value", "")4 md_body = markdownify(html_body, heading_style="ATX", strip=["img"])5 return f"# {title}\n\n{md_body}"
I strip images since they're not useful as LLM context - they'd just be broken image references in markdown.
Filtering Out Container Pages
Confluence spaces often have parent pages that exist purely for navigation - they contain nothing but a "Children Display" macro, which renders as a single word like "true" in the storage format. These add noise to the bot's context.
The fix: skip pages where the body content is shorter than 50 characters after conversion:
1for page in pages:2 title = page.get("title", "Untitled")3 md_content = page_to_markdown(page)45 # Skip container pages with no real content6 body_text = md_content.split("\n", 2)[-1].strip() if "\n" in md_content else ""7 if len(body_text) < MIN_CONTENT_LENGTH:8 skipped += 19 print(f" [skipped] {title} ({len(body_text)} chars)")10 continue1112 filename = f"{sanitize_filename(title)}.md"13 (output_dir / filename).write_text(md_content, encoding="utf-8")
CLI Arguments with Env Var Fallbacks
The script accepts values as CLI arguments for local use, falling back to environment variables for CI:
1parser.add_argument("--url", default=os.getenv("CONFLUENCE_URL", ""))2parser.add_argument("--email", default=os.getenv("CONFLUENCE_EMAIL", ""))3parser.add_argument("--token", default=os.getenv("CONFLUENCE_API_TOKEN", ""))4parser.add_argument("--parent-page-id", default=os.getenv("CONFLUENCE_PARENT_PAGE_ID", ""))5parser.add_argument("-o", "--output", type=Path, default=Path("confluence_pages"))
Locally:
1uv run python scripts/sync_confluence.py \2 --url https://your-domain.atlassian.net/wiki \3 --email you@company.com \4 --token your-api-token \5 --parent-page-id 12345678
In CI: environment variables are set by the workflow, so no arguments needed.
The GitHub Actions Pipeline
Same pattern as the CMS data pipeline: fetch data, write to local files, upload to GCS with gsutil. The script doesn't know about GCS, and the workflow doesn't know about Confluence's API.
The Confluence credentials live in GCP Secret Manager. The workflow fetches them with gcloud secrets versions access and passes them as environment variables to the script:
1name: Sync Confluence Pages (Confluence → GCS)23on:4 schedule:5 - cron: '0 1 * * 1' # Monday 10:00 JST6 workflow_dispatch:78env:9 GCS_MEMORY_BUCKET: your-bot-memory1011jobs:12 sync:13 runs-on: ubuntu-latest14 permissions:15 contents: read16 id-token: write1718 steps:19 - uses: actions/checkout@v620 - uses: astral-sh/setup-uv@v621 - uses: actions/setup-python@v622 with:23 python-version: '3.12'2425 - name: Install dependencies26 run: uv sync2728 - uses: google-github-actions/auth@v329 with:30 workload_identity_provider: ${{ secrets.WIF_PROVIDER }}31 service_account: ${{ secrets.WIF_SERVICE_ACCOUNT }}32 token_format: access_token3334 - uses: google-github-actions/setup-gcloud@v33536 - name: Fetch Confluence secrets37 id: secrets38 run: |39 echo "CONFLUENCE_URL=$(gcloud secrets versions access latest --secret=CONFLUENCE_URL)" >> "$GITHUB_OUTPUT"40 echo "CONFLUENCE_EMAIL=$(gcloud secrets versions access latest --secret=CONFLUENCE_EMAIL)" >> "$GITHUB_OUTPUT"41 echo "CONFLUENCE_API_TOKEN=$(gcloud secrets versions access latest --secret=CONFLUENCE_API_TOKEN)" >> "$GITHUB_OUTPUT"42 echo "CONFLUENCE_PARENT_PAGE_ID=$(gcloud secrets versions access latest --secret=CONFLUENCE_PARENT_PAGE_ID)" >> "$GITHUB_OUTPUT"4344 - name: Fetch Confluence pages45 env:46 CONFLUENCE_URL: ${{ steps.secrets.outputs.CONFLUENCE_URL }}47 CONFLUENCE_EMAIL: ${{ steps.secrets.outputs.CONFLUENCE_EMAIL }}48 CONFLUENCE_API_TOKEN: ${{ steps.secrets.outputs.CONFLUENCE_API_TOKEN }}49 CONFLUENCE_PARENT_PAGE_ID: ${{ steps.secrets.outputs.CONFLUENCE_PARENT_PAGE_ID }}50 run: uv run python scripts/sync_confluence.py -o confluence_pages5152 - name: Upload to GCS53 run: gsutil -m rsync -d confluence_pages gs://${{ env.GCS_MEMORY_BUCKET }}/aem
Why gsutil rsync -d?
The -d flag deletes files in GCS that don't exist locally. If a page was removed from Confluence, it gets removed from GCS too. -m runs transfers in parallel for speed.
Why uv sync in CI?
Unlike the CMS fetch script (which uses only stdlib), this script depends on markdownify for HTML-to-markdown conversion. So the workflow installs project dependencies with uv sync before running.
Secret Manager Setup
Four secrets, all straightforward:
1echo -n "https://your-domain.atlassian.net/wiki" | \2 gcloud secrets create CONFLUENCE_URL --data-file=-34echo -n "you@company.com" | \5 gcloud secrets create CONFLUENCE_EMAIL --data-file=-67echo -n "your-api-token" | \8 gcloud secrets create CONFLUENCE_API_TOKEN --data-file=-910echo -n "12345678" | \11 gcloud secrets create CONFLUENCE_PARENT_PAGE_ID --data-file=-
The service account running the GHA workflow needs roles/secretmanager.secretAccessor to read these.
How It Fits Together
1┌──────────────┐ ┌─────────────┐ ┌──────────────────┐2│ Confluence │ │ GitHub │ │ GCS │3│ wiki pages │────▶│ Actions │────▶│ /aem/*.md │4│ (HTML) │ │ (weekly) │ │ │5└──────────────┘ │ │ │ /faq.md │6 │ fetch → │ │ /processes.md │7 │ convert → │ │ /data/archive. │8 │ upload │ │ json │9 └─────────────┘ └────────┬─────────┘10 │11 ┌─────────▼──────────┐12 │ Cloud Run │13 │ │14 │ MemoryService │15 │ loads ALL .md │16 │ files from bucket │17 └────────────────────┘
The bot doesn't know or care that some markdown files came from Confluence. It loads everything from GCS, same as before.
Gotchas
1. Space Key vs. Parent Page ID
My first version fetched all pages in a Confluence space. That pulled hundreds of unrelated pages. The CQL ancestor filter is what you want - it scopes to descendants of a specific page.
2. Container Pages Are Noise
Pages that only contain a "Children Display" macro produce almost no markdown content - just a title and "true". Without the minimum content length filter, these end up as near-empty files in the bot's context. 50 characters as a threshold catches these without filtering out real short pages.
3. Confluence API Pagination
The v1 REST API returns at most 25 results per request by default. If you have more than 25 pages under your parent, you need to paginate with the start parameter. Easy to miss if you're testing with a small space.
Wrapping Up
The pattern keeps repeating: external data source → fetch script → local files → GCS → bot reads from bucket. CMS data, knowledge markdown, and now Confluence pages all flow through the same pipeline shape. The bot's MemoryService doesn't need to know where the data came from - it just loads .md files.
Key takeaways:
- Don't ask people to maintain two places - If knowledge already lives in Confluence, sync it. Asking humans to copy-paste between systems is a losing battle.
- Scripts should be dumb pipes - The sync script fetches and converts. The CI workflow handles secrets and uploads. Neither knows about the other's concerns.
- CLI args with env var fallbacks - Same script works locally (pass arguments) and in CI (set env vars). No code branching needed.
- Filter noise early - Container pages, navigation pages, and stub pages add nothing to LLM context. A simple length check saves context window for real content.
- GCS as a universal knowledge sink - Markdown from humans, JSON from a CMS, markdown from Confluence. Different sources, same destination, same bot interface.
