Held Accountable

Minimum qualifications:

3+ years in software, data, or applied ML (or equivalent practical experience).
Strong Python or TypeScript; comfort with APIs/SDKs and JSON schemas.
Hands-on experience with modern LLMs, embeddings, vector stores, and evaluation techniques.
Experience designing experiments with clear acceptance criteria for model behavior.
Excellent written communication and a safety- and reliability-first mindset.

Preferred qualifications:

RAG at scale (indexing pipelines, hybrid search, metadata filtering).
Fine-tuning/adapters and advanced tool/function calling.
MLOps/observability for LLMs (prompt/version telemetry, tracing, cost controls).
Domain experience in support automation, knowledge search, or productivity tools.
Knowledge of data privacy, safety guardrails, and prompt injection mitigations.

About the role

We’re looking for an AI Prompt Engineer to design, test, and scale high-quality prompts and prompt-driven workflows across our products. You’ll sit at the intersection of product, applied ML, and engineering—turning fuzzy user needs into reliable LLM behaviors, then hardening those behaviors for production.

How we work

Remote-friendly with flexible collaboration hours. We value clear writing, measurable outcomes, and safety-first design. You’ll partner closely with PM, Design, and Engineering to get real features into customers’ hands.

What success looks like (first 90 days)

Reduce hallucinations by ≥30% on a target task through prompt/eval iteration.
Launch one production prompt flow with <2% failure rate and a clear rollback plan.
Publish a prompt style guide and decision log adopted by partner teams.

How to apply

Send your resume/portfolio (prompt samples/eval reports welcome) to contact@held-accountable.com with subject AI Prompt Engineer – [Your Name].

Responsibilities

Design and iterate prompts (zero-/few-shot, tool/function calling, JSON-mode) for Q&A, summarization, extraction, and agentic workflows
Build evaluation harnesses (automatic + human-in-the-loop) to measure quality, hallucinations, latency, and cost.
Create prompt chains/graphs and retrieval-augmented generation (RAG) pipelines; tune chunking, metadata, and citations.
Partner with PM/Design to translate requirements into LLM specs: context windows, grounding data, fallbacks, and guardrails.
Instrument A/B tests and track metrics: answer quality, deflection/containment, CSAT, and unit cost.
Productionize flows with versioning, observability/tracing, retries/rate limits, safety filters, and incident playbooks.
Maintain a prompt library with documentation, reusable patterns, and red-teaming checklists.