Hostex History Ingest
Loads your full Hostex guest-conversation history into an AI brain so your vacation-rental assistant can answer guest questions from real past exchanges.
- vacation rental
- AI assistant
- data ingestion
- knowledge base
- short-term rental
- automation
Setup & installation
What this seed does
This seed connects your Hostex short-term rental messaging history to an AI assistant brain. It reads through every past guest conversation — across all your properties — and distills the recurring facts (wifi passwords, parking instructions, check-in details, house rules, and more) into a structured, searchable knowledge base. Once the history is loaded, a daily background process keeps it current as new conversations arrive.
The practical result: when a guest asks "what's the wifi password?" your AI coordinator can look it up from real past conversations and reply immediately, without you having to step in. Distilled facts are stored as plain Markdown files, tracked in Git, and searchable by the AI assistant. PII (phone numbers, emails, guest names) is automatically scrubbed before anything is saved.
This seed is one link in a chain: it works on top of a Hermes-based AI agent scaffold and a brain storage layer, and delivers its full value alongside the seed-hermes-airbnb-manager seed that handles the live guest-conversation side. Installation runs a one-time bulk import of your full conversation history, then sets up a lightweight sidecar that checks for new conversations every 24 hours.
When to use it
- I want my AI rental assistant to answer 'what's the wifi password?' using what I've actually told past guests, not a static FAQ I have to maintain.
- I just set up an AI coordinator for my Airbnb properties and want to give it a head start by loading years of real guest Q&A before it goes live.
- I manage multiple vacation rental properties and want one AI assistant that knows the specific quirks and details of each property from actual conversations.
- A guest keeps asking about parking, and I want my AI to cite what past guests were told rather than pinging me every single time.
- I want my Hostex conversation history to keep the AI brain up to date automatically — new conversations should feed in on their own without me manually entering anything.
›View raw SEED.md
# Purpose
> See [[README#Purpose]].
## Normative Language
The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.
`Implementation-defined` means the behavior is part of the implementation contract; this specification does not prescribe a single policy.
Sub-folder SEEDs in this tree inherit the RFC 2119 declaration. They MUST NOT re-declare it.
## Dependencies
### Adjacent seeds (REQUIRED)
- `https://github.com/plow-pbc/seed-hermes` MUST be installed and a `<scaffold>/data/` bind-mount MUST be present. This seed targets the same scaffold. ^dep-seed-hermes
- `https://github.com/plow-pbc/seed-hermes-gbrain` MUST be installed in the same scaffold; this seed writes pages directly into gbrain via `gbrain put facts/<prop>/<topic>` (v0.2.0+; v0.1.x wrote to `/opt/data/home/brain/facts/` flat files and relied on `gbrain-sync` to index them). The substrate's subprocess HOME (`/opt/data/home/`) and embeddings-on default (`openai:text-embedding-3-large`) are inherited assumptions. ^dep-seed-hermes-gbrain
- `https://github.com/plow-pbc/seed-hermes-airbnb-manager` SHOULD be installed for the load-bearing end-to-end Verify (V9). The seed CAN be installed without airbnb-manager — fact pages are still written and indexed — but the "boss skill cites a fact" regression gate (V9) requires airbnb-manager's boss skill to be the consumer. Without it, V9 SHOULD be skipped via `verify.sh --skip-boss-citation`. ^dep-seed-hermes-airbnb-manager
### Runtime
- Hermes Agent MUST run in the Docker-backed `seed-hermes` shape: a host `compose.yaml`, a whole `./data:/opt/data` bind mount, and `HERMES_HOME=/opt/data` inside the container (inherited from `seed-hermes-gbrain` `^dep-hermes-docker`). ^dep-hermes-docker
- The Hermes runtime's subprocess-HOME injection at `/opt/data/home/` MUST be active. This seed's installer, sidecar, and all scripts run with `HOME=/opt/data/home` to match. Failure looks like "No brain configured" on `gbrain put`/`gbrain search` from this seed's processes. ^dep-subprocess-home
- The container's login-shell PATH MUST include `/usr/local/bin` (where gbrain and bun are symlinked by `seed-hermes-gbrain`). ^dep-path-login
- Hermes provider auth MUST be configured at the scaffold level (any of `openai-codex`, `openai-api`, `anthropic`). This seed creates a per-profile config that mirrors the scaffold's `model.provider`/`model.default`. ^dep-hermes-provider
- The container MUST have outbound network access to `https://api.hostex.io` (or whichever `HOSTEX_BASE_URL` is configured). ^dep-hostex-net
- The container MUST have `bash`, `curl`, `git`, `flock`, `python3`. The official `nousresearch/hermes-agent` image (Debian 13 trixie) supplies these five. `jq` is NOT in the base image and is installed by the installer at install time (apt-get; writes to the container's writable layer, so it needs re-running if the container is recreated). Same pattern `seed-hermes-gbrain` uses for `unzip`. ^dep-container-tools
### Host
- The host MUST have Docker with Compose support and a `seed-hermes` scaffold prepared with `./scripts/prepare.sh`. ^dep-host-docker
- The host MUST be able to bring up at least three Compose services on the same project: `hermes`, `gbrain-sync`, and `hostex-ingest-cron`. ^dep-host-multi-service
- The host setup path MUST NOT require host `hermes`, host `bun`, host `gbrain`, host writes outside the scaffold directory, or container-side network access after install completes (the cron sidecar makes outbound HTTPS to Hostex at every tick — that's runtime, not install). ^dep-host-minimal
### Operator inputs
- `HOSTEX_ACCESS_TOKEN` — REQUIRED. Hostex API token (sent as `Hostex-Access-Token` header). Obtained from the operator's Hostex settings → API page. Mode `600` in `<scaffold>/data/.hostex-ingest.env`. ^dep-hostex-token
- `HOSTEX_BASE_URL` — OPTIONAL, default `https://api.hostex.io`. Override for staging / local fake-Hostex (e.g. `seedlab/seeds/dev-harness/dtu-hostex.seed.md`). ^dep-hostex-base
- `HOSTEX_INGEST_INTERVAL` — OPTIONAL, default `86400` (24h). Sidecar sleep-between-ticks in seconds. Minimum recommended 3600 (Hostex rate-limit caution). ^dep-ingest-interval
- `HOSTEX_INGEST_DISTILLER_MODEL` — OPTIONAL, default = scaffold's `model.default`. Override to a cheaper model for distillation (`gpt-5-mini`, `claude-haiku-4-5-20251001`, etc.). Written into `data/profiles/hostex-distiller/config.yaml` `model.default`. ^dep-distiller-model
## Objects
The named entities that exist on the Hermes / brain side. Plow Chat, gbrain page mechanics, and the airbnb-manager boss/listener skills are defined in their respective seeds; this seed does not redefine them.
### Hermes profile (DISTILLER)
- A Hermes profile (default name `hostex-distiller`) MUST exist on the scaffold and MUST have its `config.yaml` `model:` block mirrored from the scaffold-level config, with `model.default` overridable via `HOSTEX_INGEST_DISTILLER_MODEL` (per `seed-hermes-gbrain` `^act-profile-model-mirror`). ^obj-distiller-profile
- The profile MUST NOT have any client-facing platforms enabled (no Hostex webhook, no plow_chat, no telegram). It exists solely to be invoked via `hermes -p hostex-distiller chat -q "<prompt>"` by the ingest scripts. ^obj-distiller-no-platforms
- The profile's `data/SOUL.md` MUST contain the distiller persona from `ref/hermes-soul/distiller-SOUL.md`. ^obj-distiller-soul
- The profile's `data/skills/distill-conversation/SKILL.md` MUST contain the contents of `ref/hermes-skills/distill-conversation/SKILL.md`. ^obj-distiller-skill
### Brain pages — `facts/` (NEW page class)
All pages live under `/opt/data/home/brain/facts/`. The `gbrain-sync` sidecar indexes them on its 5-minute tick. Writes MUST be flock-protected and git-committed.
- `/opt/data/home/brain/facts/<property-slug>/<topic-slug>.md` is a single distilled fact about a specific property and topic. Filename pattern: `<topic_slug>.md` (no timestamps; the page IS the merge target for that `(property, topic)` pair). ^obj-fact-property-page
- `/opt/data/home/brain/facts/general/<topic-slug>.md` is a single distilled fact that is genuinely cross-property (e.g. operator-wide cancellation policy). Distillers SHOULD emit these rarely; default emission is per-property. ^obj-fact-general-page
- Frontmatter is normative. Pages MUST contain ALL of: ^obj-fact-frontmatter
- `title` (string)
- `topic_slug` (string, lowercase snake_case, MUST match filename basename without `.md`)
- `property_id` (integer OR null for `facts/general/`; Hostex returns numeric property IDs and the writer preserves the integer type)
- `property_slug` (string; `"general"` for `facts/general/`)
- `channel_types` (array of strings, non-empty, e.g. `["airbnb", "vrbo"]`)
- `source_conversation_ids` (array of strings, non-empty, append-only)
- `source_message_ids` (array of strings, non-empty, append-only)
- `first_seen_at` (ISO 8601 UTC, immutable after creation)
- `last_seen_at` (ISO 8601 UTC, set on every merge write)
- `confidence` (string, one of `high` | `medium` | `low`; high = ≥3 concordant sources, medium = 1-2, low = distiller flagged uncertainty)
- `ingest_version` (integer, `1` for v0.1.0)
- The page body MUST contain a `## Fact` section (the distilled 1-3 sentence statement) and a `## Sources` section (per-source-message verbatim quotes, each prefixed with `<conv_id>/<msg_id> @ <iso-ts>:`). Readers MAY display the body to humans; the authoritative state is the frontmatter. ^obj-fact-body
- The brain repo MUST contain `/opt/data/home/brain/facts/.gitkeep` so the `facts/` directory exists before any page is written. ^obj-facts-gitkeep
- Channel-conflict rule: if the distiller observes conflicting facts for the same `(property, topic)` across channels, it MUST emit channel-qualified `topic_slug`s (e.g. `cancellation-airbnb` and `cancellation-vrbo`) so the conflict is resolved at the page level, not within a single page. ^obj-fact-channel-conflict
### Ingest state — `/opt/data/home/.hostex-ingest/state.json`
A single file under flock control. Schema in `ref/container/hostex-ingest/state-schema.json`.
- `state_version` (integer, `1`)
- `last_walk_at` (ISO 8601 UTC, set at start of every successful walk; epoch `1970-01-01T00:00:00Z` on first run)
- `properties_known` (array of `{id, slug, title, first_seen_at}` records)
- `processed_conversation_ids` (map from `conversation_id` → `{last_distilled_at, last_message_at, fact_count}`). The watermark stored per-conversation is the `last_message_at` ISO timestamp returned by Hostex's listing endpoint; on subsequent walks, if this value is unchanged the conversation is skipped (idempotent A6).
- `api_failures` (array of `{conversation_id, last_attempt_at, attempts, reason}` — A2)
- `distill_failures` (array of `{conversation_id, last_attempt_at, attempts, reason}` — T3)
- `pii_blocks` (array of `{conversation_id, topic_slug, reason, blocked_at}` — A7)
- `sanity_drops` (array of `{conversation_id, topic_slug, reason, dropped_at}` — Q1)
State writes MUST be wrapped in `flock /opt/data/home/.hostex-ingest/state.json.lock`. Held → caller exits `EXIT_LOCKED` (exit code 75, `EX_TEMPFAIL`). ^obj-state-flock
### Cron sidecar
- `<scaffold>/compose.hostex-ingest.yaml` MUST declare a `hostex-ingest-cron` Compose service using the `nousresearch/hermes-agent` image, running as `${HERMES_UID}:${HERMES_GID}` with `HOME=/opt/data/home`, mounting `./data:/opt/data`, loading `<scaffold>/data/.hostex-ingest.env` via `env_file:`, `depends_on: hermes`, `restart: unless-stopped`. ^obj-cron-service
- The sidecar's command MUST be `bash -lc '/opt/data/home/hostex-ingest/cron-loop.sh'`. The file at that path MUST be the contents of `ref/container/hostex-ingest/cron-loop.sh`. ^obj-cron-script
- `<scaffold>/.env` MUST be updated so `COMPOSE_FILE` includes `compose.hostex-ingest.yaml` in its `:`-separated chain. Installer MUST preserve all preexisting entries. ^obj-cron-compose-file
### Host orchestration scripts
- `ref/scripts/install_hostex_ingest_into_compose.sh` is the canonical installer. It MUST be idempotent. It MUST run every install step that touches the brain repo, state file, or distiller profile as the hermes user (default uid 501, gid 20; overridable via `--uid` / `--gid`) with `HOME=/opt/data/home` pinned. ^obj-install-script
- `ref/scripts/initial_ingest.sh` is the bulk-ingest one-shot, runnable as `docker compose exec -T -u 501:20 -e HOME=/opt/data/home hermes bash -lc '/opt/data/home/hostex-ingest/initial-ingest.sh [--limit N]'`. It MUST be resumable after kill (A6) — re-running picks up from `state.json.processed_conversation_ids`. ^obj-initial-ingest
- `ref/scripts/incremental_refresh.sh` is the sidecar payload. Invoked by `cron-loop.sh`; safe to run manually for testing via `--force` (bypasses the sleep-until-next-tick wait but NOT the state flock). ^obj-incremental-refresh
- `ref/scripts/uninstall.sh` MUST stop and remove the `hostex-ingest-cron` sidecar, delete `<scaffold>/compose.hostex-ingest.yaml`, delete `<scaffold>/data/.hostex-ingest.env`, strip the seed-managed entry from `<scaffold>/.env`'s `COMPOSE_FILE`. With `--purge`, it MAY also delete the `hostex-distiller` profile (DESTRUCTIVE — clears its sessions) and remove `/opt/data/home/hostex-ingest/`. With `--purge-facts`, it MAY also delete `/opt/data/home/brain/facts/` (DESTRUCTIVE — loses all distilled history). It MUST NOT delete `state.json` even with `--purge-facts` — that's the audit trail. ^obj-uninstall-script
- `ref/verify.sh` MUST be runnable against a fresh install and a re-install. It includes V9 (the CRITICAL boss-skill citation regression) which requires `seed-hermes-airbnb-manager` to be installed; absent that, `verify.sh --skip-boss-citation` MUST exit zero on V1-V8. ^obj-verify-script
### Shared lib
- `ref/container/hostex-ingest/ingest-lib.sh` exports the functions both `initial_ingest.sh` and `incremental_refresh.sh` source: `walk_listing`, `fetch_conversation`, `distill`, `validate_distiller_output`, `merge_into_page`, `pii_sweep_fact`, `state_read`, `state_write`, `acquire_state_lock`, `release_state_lock`. ^obj-ingest-lib
## Actions
### Install
- A host agent MUST run `ref/scripts/install_hostex_ingest_into_compose.sh --scaffold <dir>` against a scaffold whose `docker compose up` is already running with `seed-hermes-gbrain` healthy. ^act-install-prereq
- The installer MUST refuse to proceed if `docker compose exec -T -u <uid>:<gid> hermes bash -lc 'which gbrain'` does not resolve. ^act-install-gbrain-guard
- The installer MUST refuse to proceed if `<scaffold>/data/home/brain/.git` does not exist (proves `seed-hermes-gbrain` initialized the brain repo). ^act-install-brain-guard
- The installer MUST read `HOSTEX_ACCESS_TOKEN` from one of (in priority order) `--hostex-token <value>`, `$HOSTEX_INGEST_TOKEN`, or an interactive prompt (silent, no echo) when stdin is a TTY. The installer MUST fail fast if no token can be obtained. ^act-install-token
- The installer MUST pre-flight the token with `curl -sS -H "Hostex-Access-Token: $HOSTEX_ACCESS_TOKEN" -H "User-Agent: curl/8.7.1" "$HOSTEX_BASE_URL/v3/properties?offset=0&limit=1"` and refuse to proceed on non-2xx. Operator MAY skip via `--skip-token-validation`. ^act-install-token-preflight
- The installer MUST write the token to `<scaffold>/data/.hostex-ingest.env` (mode 600, sidecar-only via `env_file:`). The `hermes` service in `compose.yaml` MUST NOT reference this file. ^act-install-token-storage
- The installer MUST create the `hostex-distiller` profile if missing and call `ref/scripts/lib/mirror_model_block.sh <scaffold>/data/config.yaml <scaffold>/data/profiles/hostex-distiller/config.yaml` to mirror the scaffold's `model:` block, then overwrite `model.default` with `HOSTEX_INGEST_DISTILLER_MODEL` if set. ^act-install-distiller-profile
- The installer MUST write `data/profiles/hostex-distiller/SOUL.md` from `ref/hermes-soul/distiller-SOUL.md` and `data/profiles/hostex-distiller/skills/distill-conversation/SKILL.md` from `ref/hermes-skills/distill-conversation/SKILL.md`. ^act-install-distiller-soul-skill
- The installer MUST create `/opt/data/home/hostex-ingest/` and copy `cron-loop.sh`, `initial-ingest.sh`, `incremental-refresh.sh`, `ingest-lib.sh` into it. ^act-install-container-scripts
- The installer MUST create `/opt/data/home/.hostex-ingest/` and write `state.json` with the initial schema (`state_version: 1`, `last_walk_at: 1970-01-01T00:00:00Z`, empty arrays). ^act-install-state-init
- The installer MUST create `/opt/data/home/brain/facts/.gitkeep` if missing, then `git add` + `git commit -m "ingest: init facts/ tree"` from `/opt/data/home/brain/`. ^act-install-facts-init
- The installer MUST write `<scaffold>/compose.hostex-ingest.yaml` and update `<scaffold>/.env` `COMPOSE_FILE` chain to include it, preserving prior entries. ^act-install-compose
- The installer MUST clear the distiller profile's `.skills_prompt_snapshot.json` so Hermes reloads the skill on first run. ^act-install-snapshot-clear
- The installer SHOULD print a "next steps" footer telling the operator to (1) run `docker compose up -d` to start the sidecar and (2) run `docker compose exec -T -u 501:20 -e HOME=/opt/data/home hermes bash -lc '/opt/data/home/hostex-ingest/initial-ingest.sh'` to do the bulk ingest. ^act-install-next-steps
### Bulk ingest (`initial_ingest.sh`)
- The script MUST acquire `flock /opt/data/home/.hostex-ingest/state.json.lock`. If held by the cron sidecar, exit `75` (`EX_TEMPFAIL`). ^act-initial-flock
- The script MUST walk `GET /v3/conversations?offset=&limit=100`, paginating until an empty `conversations` array. ^act-initial-walk
- The script MUST fetch `GET /v3/properties?offset=0&limit=100` once at start and build the `property_id` ↔ `property_slug` map; add unseen properties to `state.json.properties_known[]`. ^act-initial-properties
- For each conversation in listing order: ^act-initial-per-conv
- If `conversation_id` is in `state.json.processed_conversation_ids` AND its stored `last_message_at` matches the current listing entry's `last_message_at`, SKIP (idempotent A6).
- Else: `GET /v3/conversations/{conversation_id}` with retry+backoff (A2). On persistent 429: append to `state.json.api_failures[]` and continue (do NOT abort the whole run).
- Invoke distiller: `hermes -p hostex-distiller chat -q "$(distill_prompt < transcript.json)"`. On malformed JSON: retry once with a "your previous reply was not valid JSON — return only the array" follow-up. On second failure: append to `state.json.distill_failures[]` and continue.
- Validate distiller output: `validate_distiller_output` (Q1) — drop any fact whose `source_message_ids[]` contains zero IDs present in the input transcript. Dropped → `state.json.sanity_drops[]`.
- PII sweep (A7): for each surviving fact, regex-check the body for `\+\d{10,}`, email pattern, OR any guest_name from the input. On hit: drop and append to `state.json.pii_blocks[]`.
- For each surviving fact: `merge_into_page(fact)`:
- Target page = `facts/<property-slug>/<topic-slug>.md` (or `facts/general/<topic-slug>.md` if `property_id` is null).
- If page exists: read frontmatter, append `source_conversation_ids[]` + `source_message_ids[]` (dedup), set `last_seen_at`, increment `confidence` (medium → high at ≥3 sources). Re-distill body if total sources crossed the threshold of 5 (Implementation-defined whether to re-distill on every merge or batch; v0.1.0 SHOULD batch — re-distill once per `initial_ingest.sh` run per page that crossed threshold).
- If page doesn't exist: write fresh.
- Acquire `flock /opt/data/home/brain/.ingest-write.lock`, write page, `git add` page, release lock.
- Re-acquire state lock, append `conversation_id` to `state.json.processed_conversation_ids`, release.
- The script MUST batch git commits in groups of 25 conversations: `git commit -m "ingest: bulk batch <N>/<total>"`. Final commit: `git commit -m "ingest: bulk done — <N> conversations, <M> facts"`. ^act-initial-commit-batch
- The script MUST set `state.json.last_walk_at` to the walk-start timestamp on success. ^act-initial-watermark
- The script MUST accept `--limit N` for partial ingest (testing); default unlimited. ^act-initial-limit-flag
### Incremental refresh (`incremental_refresh.sh`)
- The script MUST acquire the state lock; held → exit 75 cleanly (A3). ^act-refresh-flock
- The script MUST read `state.json.last_walk_at`, snapshot `walk_start = now()` (UTC), then re-fetch `/v3/properties` (A4) — new entries → `properties_known[]`. ^act-refresh-properties
- The script MUST walk `/v3/conversations` until either (a) `conversation.last_message_at <= state.last_walk_at` OR (b) offset exceeds `MAX_INCREMENTAL_PAGES` (default 50 = 5000 convs). Whichever comes first. ^act-refresh-walk
- For each candidate conversation: fetch detail, filter messages where `created_at > state.processed_conversation_ids[id].last_distilled_at` (or all messages if first time seen). Distill only the new messages' contribution. Merge per `^act-initial-per-conv` rules. ^act-refresh-per-conv
- On success, set `state.json.last_walk_at = walk_start`, commit `git commit -m "ingest: refresh — <N> convs updated, <M> facts touched"`. ^act-refresh-watermark
- The script MUST accept `--force` (bypass the cron-loop sleep — for testing) and `--dry-run` (walk + distill but do not write pages or commits). ^act-refresh-flags
### Cron loop (`cron-loop.sh`)
- The loop MUST sleep `HOSTEX_INGEST_INTERVAL` seconds between ticks (default 86400). On startup, sleep for `HOSTEX_INGEST_INTERVAL_INITIAL_SLEEP` (default 60) before first tick — so the container doesn't hammer Hostex during operator install. ^act-cron-sleep
- Each tick MUST invoke `incremental_refresh.sh` and log stdout+stderr to `/opt/data/home/hostex-ingest/logs/cron.log` (rotated by size; keep last 1MB). ^act-cron-log
- The loop MUST trap `SIGTERM` and `SIGINT` for clean shutdown (sleep interruptible). ^act-cron-signals
- The loop MUST NOT call any LLM directly — all LLM work happens inside `incremental_refresh.sh` via the `hostex-distiller` profile. ^act-cron-no-llm
### Distiller skill
- The skill at `data/profiles/hostex-distiller/skills/distill-conversation/SKILL.md` MUST advertise frontmatter: `name: distill-conversation`, `version: 1.0.0`, a description that includes the markers `transcript-in`, `facts-JSON-out`, `redact PII`. ^act-distiller-skill-frontmatter
- The skill's body MUST instruct the model to return EXACTLY a JSON array of `{topic_slug, fact, confidence, property_id, channel_types, source_message_ids}` records and nothing else. Topic slugs lowercase snake_case, free vocabulary. Facts about the PROPERTY, not the GUEST. Redact PII (guest names, phone numbers, emails, payment details). Channel-conflict rule: if conflicting facts across channels, emit channel-qualified topic_slugs. ^act-distiller-skill-contract
- The skill MUST include 1 fixture-aligned example in its body so the model anchors on the expected shape. ^act-distiller-skill-example
## Verify
Run `ref/verify.sh --scaffold <scaffold>` from the host. Exit 0 = Done.
1. **V1. Prerequisites:** `docker compose exec -T -u 501:20 hermes bash -lc 'which gbrain'` resolves; `<scaffold>/data/home/brain/.git` exists. ^v-prereqs
2. **V2. Distiller profile:** `<scaffold>/data/profiles/hostex-distiller/SOUL.md`, `skills/distill-conversation/SKILL.md`, and `config.yaml` (with mirrored `model:` block) all exist. ^v-distiller-profile
3. **V3. Distiller golden:** `hermes -p hostex-distiller chat -q "$(cat tests/fixtures/conversation-detail.fixture.json)"` returns a JSON array parseable by `jq -e 'length > 0'`; emitted `topic_slug`s ⊇ the golden set (`wifi`, `parking` per the fixture). ^v-distiller-golden
4. **V4. Bulk ingest E2E (limit=3):** `initial_ingest.sh --limit 3` writes ≥1 page under `/opt/data/home/brain/facts/`, `git log --oneline -5` shows `ingest:` commits, `gbrain sync --repo /opt/data/home/brain` succeeds, `gbrain search wifi --limit 5` returns ≥1 hit. ^v-bulk-e2e
4a. **V4a. Idempotent re-run:** running `initial_ingest.sh --limit 3` again produces zero new commits and `state.json.processed_conversation_ids` is unchanged. ^v-idempotent
5. **V5. Sidecar up:** `docker compose ps --services --status running` includes `hostex-ingest-cron`. ^v-sidecar
6. **V6. Incremental refresh:** inject a synthetic conversation (or use the `tests/fixtures/conversations-list.fixture.json` listing) via a mocked endpoint; run `incremental_refresh.sh --force`; assert the relevant `facts/<prop>/<topic>.md` gets a new `source_message_id` appended and a new git commit `ingest: refresh ...`. ^v-incremental
7. **V7. PII redaction:** assert no page under `/opt/data/home/brain/facts/` matches the regex `\+\d{10,}` (phone) or any of the captured guest emails from the fixture. ^v-pii
8. **V8. State lock behavior:** start `incremental_refresh.sh` in background holding the state lock; run `initial_ingest.sh --limit 1` foreground; assert exit code 75 (`EXIT_LOCKED`) and clean error message. ^v-state-lock
9. **V9. CRITICAL — boss skill cites a fact:** (requires `seed-hermes-airbnb-manager` installed; skippable via `--skip-boss-citation`) bake a distinctive marker (e.g. `IngestMarker_BX42Q`) into a fact via fixture-driven `initial_ingest.sh --limit 1` against a controlled transcript; post the captured Hostex `message_created` callback (asking the question the marker answers) to the tunnel; assert the boss skill's mirrored draft contains the marker verbatim. ^v-boss-cite
## Open
- `seed-hermes-gbrain` ships with embeddings ON by default (`openai:text-embedding-3-large`). This seed depends on that for `gbrain search` to handle natural-language queries from the boss skill. If the operator opted into `--no-embedding`, this seed STILL works (gbrain falls back to keyword search) but query quality drops for synonym phrasings like "internet" vs "wifi". Document in README. ^o-no-embedding-mode
- v0.1.0 ships `ingest_version: 1` pages. When the distiller prompt evolves materially, v0.2.0 will ship `ref/scripts/redistill_all.sh` to walk every page and re-process. v0.1.0 explicitly does NOT ship this. ^o-redistill
- The `MAX_INCREMENTAL_PAGES` cap (default 50) means a long-quiet ingest can miss conversations updated >5000 entries ago. Practical operators don't hit this; document the cap. ^o-incremental-cap
- The `--limit N` flag on `initial_ingest.sh` walks conversations in listing order (most-recent-first). Operators wanting a stratified sample (e.g., random N across properties) need to write their own selector; out of scope. ^o-stratified-sample
- `mirror_model_block.sh` is now a python implementation (v0.1.1; was awk in v0.1.0 — `awk -v block="$MULTI_LINE"` couldn't represent embedded newlines). It assumes a standard YAML shape (top-level `model:` followed by indented `provider:` / `default:` keys, no anchors, no flow-style maps). For scaffolds that hand-edit the model block into a non-standard shape, verify the produced `<scaffold>/data/profiles/<distiller>/config.yaml` after install. ^o-mirror-model-block-simplicity
- **`jq` is NOT in the `nousresearch/hermes-agent` base image** (v0.1.0 claimed it was; corrected in v0.1.1). The installer apt-gets jq into the container at install time, same pattern `seed-hermes-gbrain` uses for `unzip`. Caveat: `docker compose up` recreating the container drops the apt install; re-running the installer fixes it. A future v0.1.2 may switch to a derived image with jq baked in. ^o-jq-install
- **`hermes` is at `/opt/hermes/.venv/bin/hermes` inside the hermes-main container, NOT symlinked to `/usr/local/bin/hermes`.** Sibling sidecars like airbnb-courier have it on PATH because their own installers symlink it into THEIR writable layer; the hermes-main service doesn't. v0.1.1 detects `HERMES_BIN` once in `ingest-lib.sh` and uses the absolute path so `distill()` survives container recreates. ^o-hermes-binpath
- **`gbrain search` / `gbrain query` from inside the hermes container crashes on macOS 26.3 hosts** (upstream gbrain PGLite WASM bug, garrytan/gbrain#223). The `gbrain-sync` sidecar still embeds pages successfully (different code path). The downstream consumer (airbnb-coordinator-boss v10) reads `team/`, `properties/`, `queries/` via direct file access — NOT via `gbrain search` — so the seed's `facts/` pages are still consumable. `test-ingest-e2e.sh`'s Stage 5 citation proof uses the same direct-file-read pattern. v0.1.x does not depend on runtime `gbrain search` until the upstream WASM bug is fixed. ^o-gbrain-wasm-bug
- Pagination overlap edge case: Hostex's listing is offset-paginated and sorted by `last_message_at` desc. If a new conversation arrives during a walk, subsequent offset positions shift by one. The next incremental refresh will catch the missed conversation (its `last_message_at` will be greater than the post-walk watermark). For long-running operators the practical effect is "small windows where a brand-new conversation is delayed by one refresh cycle." Not a data-loss bug. ^o-pagination-overlap (codex P1d)
## Non-goals
- Real-time webhook sync — already handled by `seed-hermes-airbnb-manager` boss skill for the live-message path. This seed is the COLD HISTORY layer.
- Escalation surface — already handled by `seed-hermes-airbnb-manager`'s `^act-courier-per-ask` escalate path. This seed just makes the brain richer; escalation is downstream.
- Brain page editing/deletion UI — operator edits via direct file write + git commit. This seed only appends.
- Multi-Hostex-account support — single-token deployment in v0.1.0.
- Read-side query helper — `seed-hermes-airbnb-manager`'s boss skill already calls `gbrain search`. This seed only writes.
- Owner dashboard view of facts — `seed-plow-airbnb-dashboard` territory.
- Embedding key management — inherited from `seed-hermes-gbrain` `.gbrain-sync.env`.
- Re-distillation of v1 pages on a v2 distiller — deferred to v0.2.0 (`o-redistill`).
- Stratified-sample bulk ingest — operator can run `--limit N` for most-recent-N; smarter sampling out of scope (`o-stratified-sample`).
- Outside voice via `/codex review` of this PLAN — explicitly skipped per CEO "speed > polish"; codex review of the IMPL runs at Gate 3.
›View raw README.md
# seed-hostex-history-ingest
> Backfill the operator's Hermes brain with **distilled facts from every historical Hostex conversation across every property**, then keep it current via a sidecar cron, so the airbnb-coordinator boss skill can `gbrain search "<topic> <property>"` and reply from real history — escalating to the owner only when search returns nothing.
## Purpose
This seed produces a new brain-page class (`facts/`) sitting alongside the existing `team/`, `properties/`, `queries/` page classes defined by `seed-hermes-airbnb-manager`. It **writes**; it does not read.
The data flow:
```
Hostex API ──► initial-ingest.sh (one-shot) ┐
+ incremental-refresh.sh (daily cron) ┼─► hostex-distiller profile ─► JSON facts ─► /opt/data/home/brain/facts/<prop>/<topic>.md
(state.json watermark + idempotency) ┘ │
│
┌───────── gbrain-sync (5 min) ─────────────┘
▼
gbrain index
▲
│
seed-hermes-airbnb-manager boss skill ─► gbrain search "<topic> <prop>" ─► cite verbatim OR escalate
```
## Quick start
Requires `seed-hermes` + `seed-hermes-gbrain` already running in a scaffold (`./hermes-agent` by default).
```sh
# 1. Install
./ref/scripts/install_hostex_ingest_into_compose.sh \
--scaffold ./hermes-agent \
--hostex-token "$HOSTEX_ACCESS_TOKEN"
# 2. Start the sidecar
cd hermes-agent && docker compose up -d
# 3. Bulk-ingest history (one-shot; ~30 min for ~350 conversations at gpt-5.5 default)
docker compose exec -T -u 501:20 -e HOME=/opt/data/home hermes \
bash -lc '/opt/data/home/hostex-ingest/initial-ingest.sh'
# 4. Verify
./ref/verify.sh --scaffold ./hermes-agent
```
After step 4 passes, the sidecar runs incremental refreshes every 24h (override via `HOSTEX_INGEST_INTERVAL`). New conversations and new messages in existing conversations get distilled into fact pages automatically.
## How facts get written
Each fact page lives at `/opt/data/home/brain/facts/<property-slug>/<topic-slug>.md` (or `facts/general/<topic-slug>.md` for cross-property facts). Per-`(property, topic)` merge: re-encountering the same topic from a new conversation appends source IDs and re-renders the body.
Example output:
```markdown
---
title: "Mtn Home — Wifi"
topic_slug: wifi
property_id: 12051776
property_slug: mtn-home
channel_types:
- "airbnb"
- "vrbo"
source_conversation_ids:
- "0-2522317621"
- "0-2444094028"
source_message_ids:
- "msg-abc"
- "msg-def"
first_seen_at: "2026-05-25T20:00:00Z"
last_seen_at: "2026-05-25T20:00:00Z"
confidence: high
ingest_version: 1
---
## Fact
Wifi network is MtnHomeGuest with password t1gers_W3lcome. Available throughout the cabin including the dock area.
## Sources
- 0-2522317621/msg-def @ 2026-05-23T19:05:00Z
- 0-2444094028/msg-abc @ 2026-05-12T14:22:00Z
```
## Configuration knobs
All in `<scaffold>/data/.hostex-ingest.env` (mode 600, sidecar-only):
| Env | Default | Purpose |
|---|---|---|
| `HOSTEX_ACCESS_TOKEN` | (required) | Hostex API token |
| `HOSTEX_BASE_URL` | `https://api.hostex.io` | Override for staging / local fake-Hostex |
| `HOSTEX_INGEST_INTERVAL` | `86400` (24h) | Cron tick interval. Min recommended: 3600 |
| `HOSTEX_INGEST_INTERVAL_INITIAL_SLEEP` | `60` | Pre-first-tick sleep so install doesn't hammer Hostex |
| `HOSTEX_INGEST_DISTILLER_PROFILE` | `hostex-distiller` | Profile name |
| `HOSTEX_INGEST_DISTILLER_MODEL` | (= scaffold's `model.default`) | Override to cheaper model for bulk distillation |
## PII protection (defense in depth)
The distiller SOUL.md instructs the model to redact PII. A post-distillation regex sweep then drops any fact whose body contains phone-like patterns (`\+\d{10,}`), email patterns, or any guest name from the input transcript. Blocked facts log to `state.json.pii_blocks[]` for operator review.
Verify V7 asserts that no page under `facts/` matches PII regexes after a bulk ingest.
## Failure modes (the ones we've actually seen)
**Symptom: `BLOCKED: gbrain not on container PATH`**
- Detect: `docker compose exec hermes which gbrain` fails
- Fix: install `seed-hermes-gbrain` first. This seed is a chain link, not a bootstrap.
**Symptom: `hermes chat` exits to setup wizard ("no API keys or providers found") for `hostex-distiller`**
- Detect: `cat <scaffold>/data/profiles/hostex-distiller/config.yaml` lacks a `model:` block
- Fix: re-run the installer. It calls `mirror_model_block.sh` to copy the scaffold's `model:` into the profile.
**Symptom: distiller returns prose around the JSON (model didn't follow SOUL output contract)**
- Detect: `state.json.distill_failures[]` accumulates entries
- Fix: a) check the distiller profile is loading the current SKILL.md (`rm <scaffold>/data/profiles/hostex-distiller/.skills_prompt_snapshot.json` to force reload). b) switch to a stronger model via `HOSTEX_INGEST_DISTILLER_MODEL` and re-run.
**Symptom: `EXIT_LOCKED=75` from `initial-ingest.sh`**
- Detect: exit code 75, log line "state lock held; exiting EXIT_LOCKED=75"
- Fix: the cron sidecar is mid-tick. Wait or `docker compose stop hostex-ingest-cron` first, run bulk, then `start` it again.
**Symptom: HTTP 429 from Hostex during bulk**
- Detect: `state.json.api_failures[]` populated; logs show "backing off"
- Fix: ingest retries with exponential backoff 2s/4s/8s × 3. Persistent failures retry on next incremental tick.
## Substrate chain
```
seed-hermes (Docker scaffold)
└─► seed-hermes-gbrain (brain repo, sync sidecar, embeddings)
├─► seed-hermes-airbnb-manager (boss + listener skills + courier sidecar)
└─► seed-hostex-history-ingest [THIS SEED] (history → facts/)
```
Either `airbnb-manager` or `hostex-history-ingest` can be installed independently — they don't depend on each other. But the LOAD-BEARING value of this seed lands when both are running: history-ingest fills the brain, airbnb-manager's boss skill cites it.
## Layout
```
seed-hostex-history-ingest/
├── SEED.md # the RFC 2119 contract
├── README.md # this file
├── LICENSE
├── CHANGELOG.md
├── justfile # just install / just verify / just uninstall
├── ref/
│ ├── scripts/
│ │ ├── install_hostex_ingest_into_compose.sh # canonical installer
│ │ ├── uninstall.sh
│ │ └── lib/mirror_model_block.sh # YAML model: block mirror
│ ├── container/hostex-ingest/
│ │ ├── ingest-lib.sh # shared functions
│ │ ├── initial-ingest.sh # bulk one-shot
│ │ ├── incremental-refresh.sh # sidecar payload
│ │ ├── cron-loop.sh # sidecar entrypoint
│ │ └── state-schema.json # state.json schema
│ ├── hermes-soul/distiller-SOUL.md
│ ├── hermes-skills/distill-conversation/SKILL.md
│ ├── compose/compose.hostex-ingest.yaml
│ └── verify.sh
└── tests/
├── fixtures/{conversation-detail,conversations-list,distiller-output.golden}.json
└── test_*.sh # unit tests for ingest-lib functions
```
## License
MIT.
Version history
1 releaseInitial Seed Release.
Dependencies
7 required · 3 optional- Seedseed-hermesrequired· seed ›· SEED.md: 'MUST be installed and a <scaffold>/data/ bind-mount MUST be present'
- Seedseed-hermes-gbrainrequired· seed ›· SEED.md: 'MUST be installed in the same scaffold; this seed writes pages directly into gbrain'
- Seedseed-hermes-airbnb-manageroptional· seed ›· SEED.md: 'SHOULD be installed for the load-bearing end-to-end Verify (V9)'; seed works without it but V9 regression gate requires it
- SWDocker + Composerequired· link ›· SEED.md: 'The host MUST have Docker with Compose support and a seed-hermes scaffold prepared'
- APIHostex APIrequired· link ›· SEED.md: 'container MUST have outbound network access to https://api.hostex.io'
- StateHOSTEX_ACCESS_TOKENrequired· SEED.md: 'REQUIRED. Hostex API token (sent as Hostex-Access-Token header). Obtained from operator Hostex settings → API page. Mode 600 in <scaffold>/data/.hostex-ingest.env'
- SWbash, curl, git, flock, python3required· SEED.md: 'container MUST have bash, curl, git, flock, python3. The official nousresearch/hermes-agent image (Debian 13 trixie) supplies these five'
- SWjqrequired· link ›· SEED.md: 'jq is NOT in the base image and is installed by the installer at install time (apt-get)'
- StateHOSTEX_BASE_URLoptional· SEED.md: 'OPTIONAL, default https://api.hostex.io. Override for staging / local fake-Hostex'
- StateHOSTEX_INGEST_DISTILLER_MODELoptional· SEED.md: 'OPTIONAL, default = scaffold model.default. Override to a cheaper model for distillation'
Contributors
1 contributorActivity
0 commentsYou need to be signed in with GitHub to comment.
No comments yet — be the first to share how this seed worked for you.
Similar seeds
Turns your short-term rental AI assistant into a full team coordinator — routing guest questions to staff over iMessage and pinging you only to approve the final reply.
by @plow-pbcDirectly paired seed that turns Hostex conversations into a coordinated short-term rental AI assistant using the same Hermes agent scaffold.
Give your Hermes AI agent a persistent, searchable knowledge graph it can read and write without any image rebuild.
by @plow-pbcProvides the persistent knowledge-graph brain layer that stores and retrieves the ingested Hostex conversation facts for the AI assistant.
Run the Hermes AI agent locally in Docker with a browser dashboard, ChatGPT login, and files you can edit directly from your computer.
by @plow-pbcRuns the foundational Hermes AI agent locally in Docker, which is the base platform this seed builds upon.
Connect your Hermes AI agent to iMessage/SMS by wiring it to the Plow Chat API via a direct-mounted gateway plugin.
by @plow-pbcExtends Hermes to additional messaging channels, relevant if you want guest conversations from multiple platforms ingested into the knowledge base.