Shipping LiveTranslator in a Week to Help My Parents Travel
A sprint to build a multi-provider live translation stack turned awkward family visits into fluid conversations across Polish, Arabic, and English.
Everyone’s talking about AI productivity gains. “10x developer.” “ChatGPT changed my life.” But when I asked engineer friends to show me their productivity data, guess how many had any? Zero. We’re all paying $100–200/month for AI tools based on vibes. I decided to fix that for myself. I’m finally starting to measure whether my elaborate multi-agent system is actually worth the investment—or if I’ve just built an expensive hobby.
So yeah, everybody is talking about AI productivity, vibe coding, and how fast you can code. But once you actually enter the vibe coding world and start reading people’s opinions, you notice something interesting—not everyone is thrilled. Productivity gains often seem subjective, and we might be paying for compute power that doesn’t really translate into real value.
This post is my way of dealing with that uncertainty. It’s an attempt to define a clear set of KPIs that, over time, will show whether I’m truly getting productivity gains—or just dreaming.
I’m not trying to answer the question whether AI in general brings productivity gains. There is a great study done by Stanford that I’m going to link in a separate post about productivity gains in enterprise environments, across different projects and languages, and it’s really thorough. But my problem is a little bit different. I’m an engineering manager with a day job, hobbies, friends, and a vivid life. Since Claude Code and vibe coding appeared, I started to be able to squeeze into my busy day a bit of application creation. From someone who didn’t even have that on the radar because it would take too much time and focus to deliver an app, I can now actually do it. I’ve already solved a couple of real problems around me with vibe coding.
So my issue is not whether AI gave me productivity—it did. But from the point where I am now, I feel stuck with my own optimization fallacy. This week I added a lot of agents to my workflow, completely changed how Claude Code works, how the architect reviews the developer’s work, how the BA creates test cases and use cases, and how they interact with each other. My personal opinion is that I became much more productive.
But if you go online and start reading vibe coder opinions, everyone says the same thing—that their setup is the best, that they use MCP servers, and their productivity is through the roof. How do we know? How do I know that what I’m doing is actually bringing me value? Like with many things in life, we tend to deceive ourselves. Maybe it’s just my impression that I’m coding faster and better with fewer mistakes, while in reality I’m slower, spending more tokens per line of code, and the defect leakage is the same.
My attempt here is to answer those questions: What is my productivity baseline in my AI environment? How did my agentic setup impact that baseline? What metrics exist in the industry to measure agentic framework effectiveness? And how can I apply them to my projects?
So let’s be honest here, because it’s not just the cost—the $100 that I’m paying monthly for Claude Code—but also the amount of work and value that I’m getting out of it. If every Friday I’m running out of credits and hitting the limits, I want to understand whether I’m getting the optimal value for what I’m paying.
I believe this can be optimized, and that I can get better value out of my money. Because if everyone gets the same number of credits, but some people use them more effectively, that’s what will create the real market difference: how effective you are at using new tools and your environment.
These are my main questions right now, and I want to understand this better.
So the honest truth is that right now I don’t have a baseline yet. I’m just starting to monitor it. Beginning this week, I’m changing how I manage tasks and will start observing the epics, stories, and bugs that I deliver each week. For every effort, I’ll assign three values: business complexity, architecture complexity, and lines of code generated. These will be my three main vectors for comparing results going forward.
Each of those items will also include a set of KPIs I’ll track—runtime of every agent, number of tokens used, number of lines of code generated, and different indicators of code quality such as complexity and churn, basically everything SonarQube provides.
Every agentic setup will be versioned, so if I make a change, I’ll know it’s implemented under a new configuration. Then I’ll observe various ratios, including:
What metrics actually matter to me most is the output I’m having—it’s not just code generated, but code used. At the end, I’m going to challenge myself every week to finish with something tangible delivered, measure its quality, and measure by defect leakage. I’ll compare results week to week to see if I’m improving, using the three complexity KPIs I mentioned earlier, and I’ll also track business outcomes expressed in traffic, bugs, leads, or anything connected with that.
I’m implementing a custom MCP server that will be connected with my agentic framework, and all the work of the agent is going to be reported to the MCP server, with the KPIs recorded for every project and every initiative that I’ll be doing going forward with AI. Hopefully, in 30 days I’m going to have some results to share and some insights for myself that will allow me to really track the progress and effectiveness of my own work.
I’m also logging the data visually so I can compare agents, configurations, and frameworks at a glance. These are the dashboards I check every Friday before deciding what to tweak for the next sprint:
Note: These visuals still show mocked data so I can validate the dashboards. The real numbers are now being collected and will replace the placeholders soon.
Continue reading
A sprint to build a multi-provider live translation stack turned awkward family visits into fluid conversations across Polish, Arabic, and English.
How fifteen years across QA, product, and engineering led me from Warsaw to Cairo—and why I started documenting my AI journey for the volatile decade ahead.
I shipped real-time video translation and a WhatsApp bot to solve actual family problems—without a development team. Here's the step-by-step breakdown anyone can follow.