Syncing Confluence to GCS: Bridging Two Knowledge Sources

Intro

After adding similar case recommendations, the bot could answer team questions from markdown files in GCS and recommend past cases from CMS data. Two separate knowledge sources, both flowing into the same GCS bucket, both working well.

Then a problem surfaced: some of the team's knowledge lived in Confluence.

Not in the bot's GCS bucket. Not in markdown files anyone maintained. Just... in Confluence wiki pages, maintained by people who had no idea the bot existed. Asking them to also maintain a separate set of markdown files in GCS was never going to happen. And copy-pasting wiki content into markdown files every time someone updated a Confluence page was exactly the kind of manual process that rots over time.

The goal: Use Confluence as a knowledge source without duplicating content. Confluence stays the source of truth, GCS stays the bot's memory, and a sync script bridges the two.

What We're Building

Before: Knowledge exists in two places - GCS (bot reads this) and Confluence (humans maintain this). They don't talk to each other.

After: A weekly pipeline fetches Confluence pages, converts them to markdown, and uploads to GCS. Confluence remains the single source of truth. The bot picks up changes automatically.

Key features:

Fetch only pages under a specific parent page (not the entire space)
Convert Confluence HTML to clean markdown
Filter out empty container pages
Run locally for testing, run weekly in CI for production

Why Not Read Confluence Directly?

I considered three approaches:

Approach	Pros	Cons
Sync to GCS	Zero bot code changes, Confluence is source of truth	Slight staleness (weekly sync)
Read at runtime	Always fresh	Runtime dependency on Confluence API, slower startup
Claude tool call	Scales better, on-demand	Complex, extra latency per query, prompt engineering

Sync to GCS won because the bot's MemoryService already loads all .md files from the bucket. Drop markdown files under a new prefix, and the bot picks them up with zero code changes. The Confluence API only gets hit once a week in CI, not on every user question.

If the knowledge base grows large enough to blow past Claude's context window, that's when the tool-call approach becomes worth the complexity. We're not there yet.

The Confluence API

Authentication Setup

Confluence Cloud uses basic auth with an API token (not your account password):

Go to https://id.atlassian.com/manage-profile/security/api-tokens
Create a new token
Your credentials are email:token, base64-encoded in the Authorization header

Fetching Pages Under a Parent

I didn't want every page in the space - that would be hundreds of unrelated pages. The Confluence REST API supports CQL (Confluence Query Language) with an ancestor filter that returns all descendants of a given page:

1GET /rest/api/content/search?cql=ancestor=12345678 AND type=page&expand=body.storage

The parent page ID is the number in the Confluence URL: https://your-domain.atlassian.net/wiki/spaces/SPACE/pages/12345678/Page+Title.

The Sync Script

The script does one thing: fetch Confluence pages and write them as local markdown files. No GCS upload, no Secret Manager calls, no framework dependencies beyond markdownify for HTML-to-markdown conversion.

1#!/usr/bin/env python3
2
3import argparse
4import base64
5import json
6import os
7import shutil
8import sys
9import urllib.parse
10import urllib.request
11from pathlib import Path
12
13from markdownify import markdownify
14
15CONFLUENCE_URL = os.getenv("CONFLUENCE_URL", "")
16CONFLUENCE_EMAIL = os.getenv("CONFLUENCE_EMAIL", "")
17CONFLUENCE_API_TOKEN = os.getenv("CONFLUENCE_API_TOKEN", "")
18CONFLUENCE_PARENT_PAGE_ID = os.getenv("CONFLUENCE_PARENT_PAGE_ID", "")
19
20PAGE_LIMIT = 25
21MIN_CONTENT_LENGTH = 50  # Skip container pages
22
23
24def _auth_header() -> str:
25    credentials = f"{CONFLUENCE_EMAIL}:{CONFLUENCE_API_TOKEN}"
26    return f"Basic {base64.b64encode(credentials.encode()).decode()}"
27
28
29def fetch_descendant_pages() -> list[dict]:
30    base = CONFLUENCE_URL.rstrip("/")
31    cql = f"ancestor={CONFLUENCE_PARENT_PAGE_ID} AND type=page"
32    start = 0
33    all_pages = []
34
35    while True:
36        url = (
37            f"{base}/rest/api/content/search"
38            f"?cql={urllib.parse.quote(cql)}"
39            f"&expand=body.storage"
40            f"&limit={PAGE_LIMIT}"
41            f"&start={start}"
42        )
43        req = urllib.request.Request(url, headers={
44            "Authorization": _auth_header(),
45            "Accept": "application/json",
46        })
47        with urllib.request.urlopen(req, timeout=30) as resp:
48            data = json.loads(resp.read().decode("utf-8"))
49
50        results = data.get("results", [])
51        all_pages.extend(results)
52
53        if len(results) < PAGE_LIMIT:
54            break
55        start += PAGE_LIMIT
56
57    return all_pages

Converting HTML to Markdown

Confluence stores page content as HTML (the "storage" format). markdownify handles the conversion:

1def page_to_markdown(page: dict) -> str:
2    title = page.get("title", "Untitled")
3    html_body = page.get("body", {}).get("storage", {}).get("value", "")
4    md_body = markdownify(html_body, heading_style="ATX", strip=["img"])
5    return f"# {title}\n\n{md_body}"

I strip images since they're not useful as LLM context - they'd just be broken image references in markdown.

Filtering Out Container Pages

Confluence spaces often have parent pages that exist purely for navigation - they contain nothing but a "Children Display" macro, which renders as a single word like "true" in the storage format. These add noise to the bot's context.

The fix: skip pages where the body content is shorter than 50 characters after conversion:

1for page in pages:
2    title = page.get("title", "Untitled")
3    md_content = page_to_markdown(page)
4
5    # Skip container pages with no real content
6    body_text = md_content.split("\n", 2)[-1].strip() if "\n" in md_content else ""
7    if len(body_text) < MIN_CONTENT_LENGTH:
8        skipped += 1
9        print(f"  [skipped] {title} ({len(body_text)} chars)")
10        continue
11
12    filename = f"{sanitize_filename(title)}.md"
13    (output_dir / filename).write_text(md_content, encoding="utf-8")

CLI Arguments with Env Var Fallbacks

The script accepts values as CLI arguments for local use, falling back to environment variables for CI:

1parser.add_argument("--url", default=os.getenv("CONFLUENCE_URL", ""))
2parser.add_argument("--email", default=os.getenv("CONFLUENCE_EMAIL", ""))
3parser.add_argument("--token", default=os.getenv("CONFLUENCE_API_TOKEN", ""))
4parser.add_argument("--parent-page-id", default=os.getenv("CONFLUENCE_PARENT_PAGE_ID", ""))
5parser.add_argument("-o", "--output", type=Path, default=Path("confluence_pages"))

Locally:

1uv run python scripts/sync_confluence.py \
2  --url https://your-domain.atlassian.net/wiki \
3  --email you@company.com \
4  --token your-api-token \
5  --parent-page-id 12345678

In CI: environment variables are set by the workflow, so no arguments needed.

The GitHub Actions Pipeline

Same pattern as the CMS data pipeline: fetch data, write to local files, upload to GCS with gsutil. The script doesn't know about GCS, and the workflow doesn't know about Confluence's API.

The Confluence credentials live in GCP Secret Manager. The workflow fetches them with gcloud secrets versions access and passes them as environment variables to the script:

1name: Sync Confluence Pages (Confluence → GCS)
2
3on:
4  schedule:
5    - cron: '0 1 * * 1' # Monday 10:00 JST
6  workflow_dispatch:
7
8env:
9  GCS_MEMORY_BUCKET: your-bot-memory
10
11jobs:
12  sync:
13    runs-on: ubuntu-latest
14    permissions:
15      contents: read
16      id-token: write
17
18    steps:
19      - uses: actions/checkout@v6
20      - uses: astral-sh/setup-uv@v6
21      - uses: actions/setup-python@v6
22        with:
23          python-version: '3.12'
24
25      - name: Install dependencies
26        run: uv sync
27
28      - uses: google-github-actions/auth@v3
29        with:
30          workload_identity_provider: ${{ secrets.WIF_PROVIDER }}
31          service_account: ${{ secrets.WIF_SERVICE_ACCOUNT }}
32          token_format: access_token
33
34      - uses: google-github-actions/setup-gcloud@v3
35
36      - name: Fetch Confluence secrets
37        id: secrets
38        run: |
39          echo "CONFLUENCE_URL=$(gcloud secrets versions access latest --secret=CONFLUENCE_URL)" >> "$GITHUB_OUTPUT"
40          echo "CONFLUENCE_EMAIL=$(gcloud secrets versions access latest --secret=CONFLUENCE_EMAIL)" >> "$GITHUB_OUTPUT"
41          echo "CONFLUENCE_API_TOKEN=$(gcloud secrets versions access latest --secret=CONFLUENCE_API_TOKEN)" >> "$GITHUB_OUTPUT"
42          echo "CONFLUENCE_PARENT_PAGE_ID=$(gcloud secrets versions access latest --secret=CONFLUENCE_PARENT_PAGE_ID)" >> "$GITHUB_OUTPUT"
43
44      - name: Fetch Confluence pages
45        env:
46          CONFLUENCE_URL: ${{ steps.secrets.outputs.CONFLUENCE_URL }}
47          CONFLUENCE_EMAIL: ${{ steps.secrets.outputs.CONFLUENCE_EMAIL }}
48          CONFLUENCE_API_TOKEN: ${{ steps.secrets.outputs.CONFLUENCE_API_TOKEN }}
49          CONFLUENCE_PARENT_PAGE_ID: ${{ steps.secrets.outputs.CONFLUENCE_PARENT_PAGE_ID }}
50        run: uv run python scripts/sync_confluence.py -o confluence_pages
51
52      - name: Upload to GCS
53        run: gsutil -m rsync -d confluence_pages gs://${{ env.GCS_MEMORY_BUCKET }}/aem

Why `gsutil rsync -d`?

The -d flag deletes files in GCS that don't exist locally. If a page was removed from Confluence, it gets removed from GCS too. -m runs transfers in parallel for speed.

Why `uv sync` in CI?

Unlike the CMS fetch script (which uses only stdlib), this script depends on markdownify for HTML-to-markdown conversion. So the workflow installs project dependencies with uv sync before running.

Secret Manager Setup

Four secrets, all straightforward:

1echo -n "https://your-domain.atlassian.net/wiki" | \
2  gcloud secrets create CONFLUENCE_URL --data-file=-
3
4echo -n "you@company.com" | \
5  gcloud secrets create CONFLUENCE_EMAIL --data-file=-
6
7echo -n "your-api-token" | \
8  gcloud secrets create CONFLUENCE_API_TOKEN --data-file=-
9
10echo -n "12345678" | \
11  gcloud secrets create CONFLUENCE_PARENT_PAGE_ID --data-file=-

The service account running the GHA workflow needs roles/secretmanager.secretAccessor to read these.

How It Fits Together

1┌──────────────┐     ┌─────────────┐     ┌──────────────────┐
2│  Confluence  │     │    GitHub   │     │      GCS         │
3│  wiki pages  │────▶│   Actions   │────▶│  /aem/*.md       │
4│  (HTML)      │     │   (weekly)  │     │                  │
5└──────────────┘     │             │     │  /faq.md         │
6                     │  fetch →    │     │  /processes.md   │
7                     │  convert →  │     │  /data/archive.  │
8                     │  upload     │     │    json          │
9                     └─────────────┘     └────────┬─────────┘
10                                                  │
11                                        ┌─────────▼──────────┐
12                                        │    Cloud Run       │
13                                        │                    │
14                                        │  MemoryService     │
15                                        │  loads ALL .md     │
16                                        │  files from bucket │
17                                        └────────────────────┘

The bot doesn't know or care that some markdown files came from Confluence. It loads everything from GCS, same as before.

Gotchas

1. Space Key vs. Parent Page ID

My first version fetched all pages in a Confluence space. That pulled hundreds of unrelated pages. The CQL ancestor filter is what you want - it scopes to descendants of a specific page.

2. Container Pages Are Noise

Pages that only contain a "Children Display" macro produce almost no markdown content - just a title and "true". Without the minimum content length filter, these end up as near-empty files in the bot's context. 50 characters as a threshold catches these without filtering out real short pages.

3. Confluence API Pagination

The v1 REST API returns at most 25 results per request by default. If you have more than 25 pages under your parent, you need to paginate with the start parameter. Easy to miss if you're testing with a small space.

Wrapping Up

The pattern keeps repeating: external data source → fetch script → local files → GCS → bot reads from bucket. CMS data, knowledge markdown, and now Confluence pages all flow through the same pipeline shape. The bot's MemoryService doesn't need to know where the data came from - it just loads .md files.

Key takeaways:

Don't ask people to maintain two places - If knowledge already lives in Confluence, sync it. Asking humans to copy-paste between systems is a losing battle.
Scripts should be dumb pipes - The sync script fetches and converts. The CI workflow handles secrets and uploads. Neither knows about the other's concerns.
CLI args with env var fallbacks - Same script works locally (pass arguments) and in CI (set env vars). No code branching needed.
Filter noise early - Container pages, navigation pages, and stub pages add nothing to LLM context. A simple length check saves context window for real content.
GCS as a universal knowledge sink - Markdown from humans, JSON from a CMS, markdown from Confluence. Different sources, same destination, same bot interface.