Live Vibe Coding Battle: Build a Java App with GitHub Copilot

Two predefined prompts, one shared Java app, and strict CI gates: the one-shot version started but failed when retrieving results, while the split-prompt workflow produced the more stable app and better quality metrics.

This stream was a simple but useful experiment: Catherine used the predefined master prompt in one shot, I used the predefined split prompt in six smaller steps, and we compared the outcomes against the same CI gates at the end. The app itself was a small healthcare assistant for symptom analysis, AI-generated advice, and specialist lookup.

Co-Speaker

Catherine Edelveis

DevRel at BellSoft

Catherine joined this live coding battle to compare prompting strategies under strict CI quality gates in a real Java app workflow.

GitHub X Bluesky

Setup of the Battle

The repo already came with a strict evaluation pipeline, so this was never about generating code quickly. It was about generating code that can hold up under pressure.

Quality gates used in the stream:

Apache Maven PMD Plugin to catch code style problems, duplication, and common maintainability issues.
SpotBugs Maven Plugin to detect likely defects such as null handling issues, bad API usage, and risky implementation patterns.
JaCoCo Maven Plugin to measure test coverage and show how much of the generated code was actually exercised.
Trivy to scan the project for known vulnerabilities in dependencies and build artifacts.
OWASP ZAP Baseline Scan to run a lightweight dynamic security check against the running application.

That setup made the session useful because Copilot output was judged by objective checks, not by first impressions.

Results from the Live Stream

GitHub Action results of One-Shot Prompt and Split Prompt.

Tool / Metric	One-Shot	Split Prompt
JaCoCo Coverage	11%	32%
PMD Violations	1	0
SpotBugs Total Bugs	26	17
Trivy Vulnerabilities	3	3
ZAP Medium	2	2
ZAP Low	6	6

The main result was clear:

the one-shot master prompt did produce a running application, but retrieving results failed due to internal errors
the split-prompt approach did produce a running application
static analysis and test-related outcomes were strongly in favor of split prompts

Important context: this was not a failure caused by Catherine. The key issue was prompt complexity. The master prompt tried to do too much in one shot, which made debugging and correction harder.

Quick Comparison: Master Prompt vs Split Prompt

Here is the practical difference between the two approaches we used.

Dimension	Master prompt (one-shot)	Split prompts (6 steps)
Scope per generation	Entire app in one go (DB, OpenAI services, REST API, Vaadin UI, tests, JMeter, Docker)	One layer at a time (domain, AI services, API, UI, tests, performance/docker)
Feedback loop	Late and wide: failures appear after many coupled changes	Early and local: each step can be built and validated before continuing
Debugging effort	High, because root causes are mixed across layers	Lower, because regressions are isolated to the current step
Architectural control	Weaker: more room for accidental cross-layer coupling	Stronger: clearer boundaries and staged integration
Session outcome	App started, but key result retrieval failed with internal errors	Reached a running app with result flow working and better quality-gate metrics

The short version: the master prompt optimized for one-pass completeness, while the split prompts optimized for incremental correctness.

UI Outcome (Unexpected)

One surprising part of the stream was UI quality: both generated applications looked rough, even though we used Vaadin and Vaadin usually has solid default styling. We could not fully explain during the session why both UIs looked that bad.

UI result from the live vibe coding battle

What Actually Worked Better

The split-prompt workflow changed two things that mattered most:

each step could be validated immediately
we could understand and review what the model changed before moving on

That gave better control over architecture and made it easier to correct Copilot when it produced confident but wrong output.

That is why the split approach was easier to steer and easier to trust.

Why the Quality Gates Mattered

Without automated checks, this stream could have looked like both approaches were "kind of fine" for a while.

The CI gates made the differences visible:

code quality issues were surfaced quickly by PMD and SpotBugs
security and analysis checks prevented silent regressions with Trivy and OWASP ZAP Baseline
we could compare prompting styles using measurable feedback

This is exactly where vibe coding gets real: if the code cannot pass your guardrails, it is not done.

Practical Takeaways

avoid oversized one-shot prompts for multi-layer Java apps
break work into smaller prompts with clear acceptance criteria
run static analysis and tests after each meaningful step
treat AI output as a draft that must be reviewed, not trusted
optimize for explainability during development, not only for speed

Helpful Links

XDEV IntelliJ PMD plugin

Final Thought

This session was a good reminder that better prompting is not about writing a longer prompt. It is about building a workflow where each step is small enough to validate.

The one-shot master prompt got the app running but still failed on internal result retrieval. The split-prompt strategy delivered the more stable end-to-end flow and better quality signals. For production-minded Java teams, that is the stronger default.

Live Vibe Coding Battle: Build a Java App with GitHub Copilot

Catherine Edelveis

Comments