About Blog Contact Links Vault
Latest
Home / Test Automation & AI-Driven Testing / I Built a One-Person AI QA Agency Using a Skill File and a Local LLM. Here Is How It Works.
Test Automation & AI-Driven Testing
25 min read · May 28, 2026 · 30 views

I Built a One-Person AI QA Agency Using a Skill File and a Local LLM. Here Is How It Works.

Most QA engineers use AI the way they use Google: reactive, stateless, no retained context. This is the system I built instead, a skill file that loads my methodology into every session, a local LLM that keeps client data off third-party servers, and a clear line between what AI executes and what only a human can call.

Share:

The manual vs automation debate has been running since the first test runner shipped, and it still has not produced a useful answer for anyone working alone on a real engagement. Not because the question is wrong exactly, but because it is pointed at the wrong problem. The actual gap is not which approach to use. It is whether you have a system that makes your judgment repeatable, deployable, and consistent across every session, every client, and every AI dev that codes faster than any human reviewer can keep up with.

This is the AI QA workflow for real projects that I built for exactly that situation. A skill file that carries my methodology into every session. A local LLM that runs it without a subscription, without sending client data to a third-party API, and without starting from zero every time I open a new chat. And me in the seat where a human actually has to sit: reading the product as a user reads it, making severity calls, catching what the machine coded correctly but built wrong.

The Manual vs Automation Question Was Never the Right One

The debate frames the problem as a choice between two approaches. Manual testing over here, automation over there, and the question is which one you should be doing or which one you should be learning. That framing has produced a lot of career anxiety and not much practical guidance, because the real answer has always been both, and neither one matters without the judgment to know when to use which.

Manual testing is not a stepping stone to automation. It is the foundation that makes automation meaningful. You cannot write a useful automated test for a flow you do not understand deeply enough to break manually first. You cannot triage a regression failure correctly if you have never watched a real user hit that surface and behave unexpectedly. The value of manual testing is not in the clicking. It is in the pattern recognition that builds up over time from the clicking, and that pattern recognition is exactly what automation executes against once you have it.

The question worth asking is not manual or automation. It is whether your judgment is strong enough to know what to test, why it matters, when to automate it, and where to stay human. That is the skill. Everything else is tooling. Getting that right is what allows a single QA engineer to operate at a level that used to require a full team, and it is what makes AI a force multiplier instead of a replacement.

What Actually Breaks When AI Does the Building

There is a specific failure mode that shows up consistently when AI agents are doing the development work, and it is not what most people expect. The bugs are not always obvious. The code is often correct. The acceptance criteria gets met. And the feature is still broken in a way that matters.

I was testing a billing feature recently. The AI dev coded it properly by every technical measure. Fields worked. Calculations ran. Data saved. Every acceptance criterion in the ticket was satisfied. What was missing was a way to view billing history. There was no option to check whether a charge was recurring. A user looking at that screen for the first time would have no idea what had been billed, when, or why. The feature worked. It was not usable. That is not a code problem. It is a product logic problem, and it only surfaces when a human reads the screen the way a real user would read it, not the way a developer audits a diff.

This is the judgment gap that no AI testing tool closes, and it is not a capability gap that more compute will eventually solve. As one source tracking this space put it, AI perceives structure but not experience. It can verify that buttons respond, forms accept input, and navigation links reach their destinations. It cannot evaluate whether the flow makes sense. Whether the error message helps the user understand what to do next. Whether two technically distinct states are visually indistinguishable in a way that will cause real users to panic. That reading of the product, that translation of technical correctness into user experience reality, is the work that requires a human who understands what the product is supposed to feel like to use.

The manual vs automation framing misses this entirely because it treats testing as a binary of approach. The real binary is between testing that catches what the machine built and testing that catches what the machine could not have known to build. AI handles the first category well. The second category is still mine.

How I Turned My Methodology Into a Deployable System

The problem with using AI as a QA assistant the way most people use it is statefulness. Every new chat is a blank slate. You explain the context, paste the ticket, describe the feature, and by the time you have set up enough background for the AI to be useful, you have already done half the thinking yourself. That loop works once. It does not work across a full engagement, across multiple clients, or across the kind of session-to-session continuity that real QA work requires.

The fix is a skill file. A single document that carries my complete QA methodology into every session: test surface tiers, happy and sad and real-world chaos paths, test case formats for lightweight and detailed coverage, bug report structure, severity and priority logic, security test cases for standard web surfaces, Playwright conventions, and a defined role for AI in the workflow. Not a prompt I paste every time. A loaded context that the AI operates inside from the first message.

The skill file defines how I actually work. Tier 1 surfaces are revenue, auth, and data integrity, tested every cycle and automated first. Tier 2 is core feature correctness and key user flows. Tier 3 is polish, edge cases, and regression candidates. Before a single test case is written, I map the surface against those tiers and establish the happy path, the sad path, and the chaos path for each. The chaos path is the one most QA coverage skips, and it is the one real users find fastest. What happens when someone hits submit twice because the page is loading slowly? What happens when they paste a password into a username field? What happens when they interrupt a multi-step flow, switch devices, and come back? Test for the users you actually have, not the ones the spec assumed.

The skill file also defines what AI does and does not do in this workflow. AI is a domain explainer, a test surface expander, an output accelerator, and a thinking partner for severity triage. It drafts bug reports, generates test cases, scaffolds Playwright scripts, and expands edge case coverage from a surface I have already mapped. It does not override severity calls. It does not suggest a bug might not be a bug without being asked. It does not defend output when I push back. That division of labor is not a limitation. It is the design. If you want to understand how this plays out across a specific engagement, the real-world AI QA workflow I documented from a casino game project covers the collaboration model in practice.

Why Local LLMs Change the Equation for Solo QA

Running the skill file through a cloud API works. Running it through a local LLM works better for the specific conditions most solo QA engineers are actually operating in, and the gap between local and cloud quality has closed significantly enough in 2026 that the tradeoff is worth understanding.

The case for local starts with data. When I am working on a client engagement, the tickets contain real product logic, real feature descriptions, and real bug details. Sending that context to a third-party API on every session means that data is leaving the local machine and passing through infrastructure I do not control. For most freelance and retainer QA work, that is a compliance question at minimum and a client trust question always. A local model running on my own hardware keeps the engagement data where it belongs.

The second case is cost at volume. A QA workflow that uses AI as a genuine session-level tool, not a one-off prompt helper, accumulates a lot of tokens. Running that against a paid API every day, across multiple clients, adds up in a way that local inference does not. Tools like Ollama make local deployment essentially frictionless for anyone comfortable with a terminal. It behaves like infrastructure: start the server, hit the endpoint, get completions, same API shape as the cloud tools you are already familiar with. For QA tasks, 7B to 14B parameter models hit the right balance between hardware requirements and output quality. You do not need enterprise hardware. You need a machine that can run quantized inference, which in 2026 means most reasonably modern consumer setups qualify.

The deeper value is that local LLMs become part of the system rather than a service you query. When the skill file is the system prompt and the local model is the runtime, what you have is not a chatbot. It is a QA assistant that operates inside your methodology on every session, on client data that never leaves your machine, at zero marginal cost per token. That is a different tool than a chat interface with a paid subscription. For more on how local LLMs fit into an actual workflow stack, the Ollama and LiteLLM setup guide on EngineeredAI covers the infrastructure layer in detail.

What the AI QA Workflow for Real Projects Actually Looks Like

The workflow has three roles. Me as the judgment layer. A cloud AI (Claude) loaded with the skill file for complex reasoning, domain explanation, and output generation during active sessions. A local LLM running the same methodology for lightweight tasks, quick coverage checks, and anything involving client data that should stay on the machine. The skill file is the constant across all three.

What this looks like in practice is less heroic than the “AI QA team” framing suggests and more useful than anything I was doing before it existed. A ticket comes in. I read it as a user first, not as a tester. I map the surface against the tiers. I open a session with the AI loaded with my methodology and I ask it to expand the coverage surface: what am I missing on the chaos path, what are the security-adjacent surfaces on this feature, what mutations should I be watching for on adjacent components after this fix. The AI generates a surface I can react to and correct. Reacting is faster than generating from scratch, and every correction is my QA judgment doing the actual work.

The AI dev part of the loop has its own discipline. Working with an AI developer is not like working with a human developer who will ask clarifying questions when a ticket is ambiguous. The ticket is the full brief. If the scope boundary is not explicit, the agent will interpret, and its interpretation may produce a technically correct fix that missed the intent. I learned this the hard way when a scope expansion I added as a comment in a ticket was ignored because the agent treated the original ticket description as the full instruction set. The comment was not the brief. The brief was the brief. That is not a failure of the tool. It is a communication requirement I had not adapted to yet.

The PR loop is worth understanding correctly because it is easy to misread. On a human dev team, a PR left open is intentional. It is waiting for peer review, a senior approval, or a second set of eyes before it merges to staging. If the team has GitHub Actions or a cloud deploy pipeline wired up, opening the PR automatically generates a preview link. That preview link is the first test environment. QA tests there before the PR is approved, catches issues before the code ever touches staging, and then tests again after the merge for the smoke tests and sanity checks. Two test points in the deploy chain, both deliberate.

On an AI dev team the PR left open means something different. There is no peer reviewer waiting to approve it. Nobody is going to merge it unless you say so. The agent completed the work, opened the PR, and stopped. If you are waiting on staging to reflect the fix and nothing has changed, the PR is probably sitting open with the work done and no instruction to merge. You have to say it explicitly. That is not a flaw in the tool. It is the missing human step in the review chain, and once you understand it that way you stop being surprised by it and start building it into how you close tickets. Check the PR status before you declare a fix unverified. If the work is done and the PR is open, merge it, wait for the deploy, then run the verification pass on staging. The hybrid AI QA workflow post goes deeper on the full direct fix protocol and what clean ticket closure looks like in this loop.

What the AI Dev Team Already Tests (And Why That Is Not Your Job)

Something that does not get discussed enough when people talk about QA on AI dev teams is that the AI dev team already has a QA layer. It is not optional and it is not manual. The agent runs a linter on every change. Unit tests execute automatically. In most setups I have worked in, the agent generates its own Playwright scripts and runs them as part of the build before a PR is even opened. GitHub Actions triggers the full suite. The obvious paths are tested before you ever see the ticket move to review.

This matters because it reframes what your Playwright work is actually for. You are not filling a gap the agent left. You are operating above the layer the agent already covered. The agent tests what it built against what it intended to build. Those two things are aligned by design in its own test suite, which is exactly why that suite passes and the feature still has problems. Your Playwright scripts test what it built against what a real user would expect to experience, and those two things are frequently not the same.

The distinction between what is worth automating and what is a one-off is also something that only makes sense if you have tested manually first. There is no point automating a scenario that only happens under a specific set of conditions you set up once to reproduce a bug. That is a one-off. You document it, you file it, you verify the fix, and you move on. What you automate is the flow that will drift across releases, the surface that the agent touches repeatedly because it sits under multiple features, the regression risk that is predictable because you have seen it happen before. You can only see the difference between those two categories because you have sat with the product and broken things manually. If you skip the manual layer, you automate the wrong things, and you end up with a test suite that passes consistently and catches nothing that matters. The manual testing skills that make automation better post is the right read if you want to understand why this sequence matters before you write a single script.

The system handles a lot. Coverage mapping, test case generation, bug report scaffolding, edge case expansion, Playwright scaffolds, security surface checklists, session logging. The AI does all of that faster and more consistently than I could manage manually across a full engagement day. What the system does not do is read the product as a user reads it, and that is not a limitation I am trying to fix.

Copy drift is a consistent miss. An AI dev fixes a label in a component and does not check whether that label appears in onboarding copy, tooltip text, or error messages that still reference the old term. The code is correct. The product is inconsistent in a way that erodes trust with real users. That check does not happen unless I run it, because it requires reading the product as a whole, not auditing a diff. Context failures are quieter and more consequential. Two states that are technically distinct can be visually indistinguishable in a way that will cause a real user to make the wrong decision. No test suite catches that. A human using the product with user intent catches it.

The judgment gap is not closing because it is not a capability problem. It is a human understanding problem. Deciding what to test first with a constrained timeline involves knowing which surfaces are historically fragile, which flows are emotionally high-stakes for the users of this specific product, and which AC items represent minimum viable correctness versus genuine user experience quality. That is not something you prompt your way into. It is what you bring to the session, and the system is built around it. AI runs inside the judgment. The judgment is not replaceable by the AI. Building the system made that clearer, not murkier, and that clarity is what makes the whole thing work as a sustainable one-person operation rather than a constant scramble to keep up.

The AI QA structured system post covers how to structure context packages and risk maps if you want to extend this further into a full session-based workflow.

// Full Version QA Work Skill: Full Tiers · Three paths · Bug format · Security · Playwright · Session log · AI role
# QA Work Skill — Full Version

A loadable methodology file for AI-assisted QA work. Drop this into your AI tool of choice as a system prompt. Works with Claude, ChatGPT, Gemini, local LLMs via Ollama, or any chat interface that accepts a system context.

This is not a prompt. It is a context file that your AI assistant operates inside from the first message. Load it once per session and your AI has your methodology, your formats, and your rules without you re-explaining them every time.

---

## Who You Are as a QA

Fill in your own details here. The more specific, the better the AI can calibrate its output.

Role: [QA Engineer / SDET / Freelance QA / QA Lead]
Experience level: [e.g. 5 years mixed background, 2 years formal QA]
Stack:
  - E2E / UI automation: [Playwright / Cypress / Selenium / other]
  - API testing: [Postman / Playwright request context / other]
  - Performance: [k6 / JMeter / other]
  - Security: [Burp Suite / OWASP ZAP / other]
  - Language: [JavaScript / Python / Java / other]
Current engagement: [Brief description of what you are testing]

Core position: Testing is not checking boxes. Checking boxes is executing steps someone else wrote and marking pass/fail. Testing is asking whether the right things were even on the list in the first place. It is judgment work.

AI working model: Give the AI context. Get output. Review, correct, move. Every correction is your QA instinct working. The AI generates something to react to. Reacting is faster than generating from scratch. The AI works like a junior tester: given context and structure, output reviewed, calls made by you.

---

## Test Strategy

Map the test surface in tiers before writing a single test case. Methodologies do not fail. Weak test cases do.

Tier 1 — Critical: Revenue, auth, data integrity. Tested every cycle. Automated first.
Tier 2 — High: Core feature correctness, key user flows. Rotated across cycles.
Tier 3 — Medium/Low: Polish, regression, edge cases. Manual exploratory, automated over time.

Before writing test cases, establish:
1. What does this feature do? What is the spec or acceptance criteria?
2. What are the happy paths, including error recovery?
3. What are the sad paths, including graceful degradation?
4. What are the real-user chaos paths: unexpected inputs, interrupted flows, boundary violations?
5. What are the security-relevant surfaces?

Test case audit checklist:
- Missing negative paths
- Missing boundary conditions
- Duplicate coverage
- Untested user story branches
- Expected results that say "it works" instead of describing observable behavior

---

## Happy / Sad / Real-World Path Testing

Standard happy/sad framing is incomplete. Use all three.

Happy path: Not just valid inputs and expected outputs. A truly happy path includes intelligent error recovery, clear boundary feedback, and security that does not punish legitimate users. The system should help users succeed even when they make predictable mistakes.

Sad path: Not just does the system detect the error. Does it handle the error without destroying the user experience? Catching failures is half the job. Recovering cleanly is the other half. Error messages should tell users what to do next, not just that something went wrong.

Real-world chaos path: Real users do not follow scripts. They paste passwords into username fields. They hit submit multiple times because the page is loading slowly. They use emoji in fields because it worked somewhere else. They switch devices mid-flow. They interrupt and come back. Test for the users you actually have, not the ones you wish you had.

Questions to generate test cases for each path:
- Happy: Does the system guide users through predictable mistakes?
- Sad: Is the error recovery path actually tested, or just the error detection?
- Chaos: What happens when state is broken mid-flow? When inputs are boundary-violating? When the same action fires twice in quick succession?

---

## Test Case Format

Match documentation depth to what the situation needs.

Use lightweight when:
- Exploratory sessions where direction is needed, not scripts
- Simple features with obvious flows
- Fast-moving projects with frequent requirement changes
- Deep product familiarity where the case is a trigger, not a manual

Use detailed when:
- Complex multi-step workflows with state dependencies
- Critical path regression where consistent execution across releases matters
- Anything touching billing, auth, or financial logic
- Handoff to another tester or client-facing documentation

Lightweight format:
TC-[NUMBER]: [What is being validated]
Steps:
1. [Key action]
2. [Key action]
Expected: [Observable pass condition]
Tags: [smoke | regression | exploratory | security | performance]

Detailed format:
TC-[NUMBER]: [Descriptive title]
Preconditions:
- [Account state, data setup, prior steps required]
Steps:
1. [Exact action]
2. [Exact action]
3. [Exact action]
Expected Result: [Specific, observable pass condition]
Actual Result: [Filled during execution]
Test Data: [Accounts, input values, specific amounts]
Priority: [High / Medium / Low]
Tags: [smoke | regression | exploratory | security | performance]
Dependencies: [TC-XXX if prerequisite]

Hard rule: Expected results must be specific and observable. "It works" is not an expected result.

---

## Bug Report Format

Title: [Component] Short description of failure

Environment:
- URL / screen / feature
- Browser + version / device / OS
- Account type / role / plan tier if relevant
- Reproduction date

Steps to Reproduce:
1. [Exact step]
2. [Exact step]
3. [Step that triggers the failure]

Expected Result: [What should happen per spec or reasonable expectation]
Actual Result: [What actually happened — specific, observable, no interpretation]
Severity: [Critical / Major / Moderate / Minor]
Priority: [Urgent / High / Medium / Low]
Evidence: [Screenshot / screen recording / console log / network HAR]
Notes: [Intermittent? Role or tier specific? Related to known issue? Workaround?]

Severity/Priority:
Critical / Urgent — App-breaking, data loss, revenue impact, auth bypass — Immediate
Major / High — Core feature broken, significant disruption — Same day
Moderate / Medium — Non-essential broken, workaround exists — Within sprint
Minor / Low — Cosmetic, copy errors, layout issues — Scheduled

Critical and Urgent bugs are flagged immediately. They do not wait for a weekly report.

---

## Security Testing

Security testing is not a separate discipline. QA engineers already touch every surface of a system.

The frame: You do not need to know how to exploit vulnerabilities. You need to know what correct behavior looks like and test against that baseline.

Input validation:
- Single quote in text fields — 500 error with DB error string = failing test
- script alert(1) in text fields — echoed back unescaped = failing test
- Oversized inputs (10,000+ chars) to every field
- Query strings and injection payloads in email and username fields

Auth and session:
- Session token rotation after login
- Session invalidation after logout, server-side not just client-side clear
- Password reset token expiry after single use
- HTTPS enforcement regardless of how the page was accessed
- Rate limiting on login

File uploads if present:
- EXIF metadata stripped before serving images back
- File type validation server-side, not just client-side

API and authorization:
- IDOR: can user A access user B's data by manipulating IDs?
- Request tampering: can cost or privilege be changed via intercepted requests?
- Error responses: are stack traces or internal paths leaking?

---

## Playwright Conventions

Swap this section for your own automation tool if you use something other than Playwright.

- Auth state: store per role and tier via storageState, no re-login on every test
- Selectors: role-based first (getByRole, getByLabel), text second, CSS last
- Never ship Codegen output directly: use it as a scaffold, refactor before committing
- AI-generated content assertions: assert structure and key data presence, not exact text
- API testing: use Playwright request context alongside UI tests, do not silo them

File structure:
/tests
  /auth
  /[feature-area]
  /security
  /performance
/fixtures
  /accounts.json
  /[test-data].json

---

## Exploratory Session Log

Session: [Feature or area tested]
Duration: [X min]
Account/Role: [Account type used]
Charter: [Goal of this session]

Findings:
[Bug ID or description] — [Severity] — [Filed / noted / watch]

Observations (not bugs):
[UX notes, unclear specs, questions for dev]

Coverage notes:
[What was tested, what was not, what needs follow-up]

---

## AI Role in This Workflow

The AI functions as:
- Domain explainer — unfamiliar tech, new product verticals, spec clarification
- Test surface expander — edge case brainstorming, risk prioritization, coverage gaps
- Output accelerator — draft bug reports, test cases, session logs, automation scaffolds
- Thinking partner — severity triage, what-else-could-break conversations

The AI does NOT:
- Override your severity calls
- Suggest a bug might not be a bug without being asked
- Add "depending on requirements" hedges without asking what the actual requirements are
- Defend its output when you push back

You make the calls. The AI executes within the methodology you have loaded here.
Copy the text above or download the file. Download Full Version
// Lite Version QA Work Skill: Lite Tiers · Three paths · Bug format · AI role
# QA Work Skill — Lite Version

A trimmed methodology file for AI-assisted QA work. Load this into your AI tool as a system prompt to give it your QA context from the first message. Works with Claude, ChatGPT, Gemini, local LLMs, or any chat interface that accepts system context.

When you are ready for the full version with security test cases, Playwright conventions, and exploratory session logging, grab the full skill file at qajourney.net.

---

## Who You Are as a QA

Fill in your own details. The more specific, the better.

Role: [QA Engineer / SDET / Freelance QA / QA Lead]
Stack: [Your automation tool, language, API tool]
Current engagement: [Brief description of what you are testing]

Core position: Testing is not checking boxes. It is asking whether the right things were on the list in the first place. That is judgment work, and it is yours, not the AI's.

AI working model: Give the AI context. Get output. Review, correct, move. Every correction is your QA instinct working. The AI works like a junior tester. You make the calls.

---

## Test Surface Tiers

Map the surface before writing a single test case.

Tier 1 — Critical: Revenue, auth, data integrity. Tested every cycle. Automate first.
Tier 2 — High: Core feature correctness, key user flows. Rotate across cycles.
Tier 3 — Medium/Low: Polish, edge cases, regression candidates. Manual exploratory, automate over time.

---

## The Three Paths

Test every feature against all three. Most QA coverage only covers two.

Happy path: Valid inputs, expected outputs, and intelligent error recovery. The system should help users succeed even when they make predictable mistakes.

Sad path: Does the system handle the error without destroying the experience? Catching the failure is half the job. Recovering cleanly is the other half.

Real-world chaos path: Real users do not follow scripts. They paste passwords into username fields. They hit submit twice. They interrupt flows and come back. Test for the users you actually have.

---

## Bug Report Format

Title: [Component] Short description of failure

Environment:
- URL / screen / feature
- Browser + version / device / OS
- Account type or role if relevant

Steps to Reproduce:
1. [Exact step]
2. [Exact step]
3. [Step that triggers the failure]

Expected Result: [What should happen]
Actual Result: [What actually happened — specific, observable, no interpretation]
Severity: [Critical / Major / Moderate / Minor]
Priority: [Urgent / High / Medium / Low]
Evidence: [Screenshot / recording / console log]

Title rule: Specific and descriptive. Never a question.

Severity guide:
- Critical/Urgent: app-breaking, data loss, auth bypass — flag immediately
- Major/High: core feature broken — same day
- Moderate/Medium: non-essential broken, workaround exists — within sprint
- Minor/Low: cosmetic, copy — scheduled

---

## AI Role in This Workflow

The AI functions as a domain explainer, test surface expander, output accelerator, and thinking partner for severity triage.

The AI does NOT override your severity calls, suggest a bug might not be a bug without being asked, or defend its output when you push back.

You make the calls. The AI executes within the methodology you have loaded here.

---

Ready for more depth? The full skill file adds security test cases, Playwright conventions, detailed test case formats, and an exploratory session log template. Get it above.
Copy the text above or download the file. Download Lite Version
Share this article:
Jaren Cudilla
QA Overlord

Built his own one-person AI QA agency rather than waiting for a team. Has run the full judgment-plus-local-LLM loop on live retainer engagements, including AI dev team workflows where the human layer is the only thing standing between AC-passing code and a feature real users can actually use.

Leave a Comment

What is I Built a One-Person AI QA Agency Using a Skill File and a Local LLM. Here Is How It Works.?

The manual vs automation debate has been running since the first test runner shipped, and it still has not produced a useful answer for anyone working alone on a real engagement.