The Starting Point
RobinRelay started with a strong idea: what if your Slack workspace could help engineers understand alerts, incidents, ownership, and infrastructure context without forcing them to jump between tools?
When I joined as a founding engineer, the product was still early. It could send scheduled Slack updates about alert frequency and severity, which was useful, but it still felt like a notification bot.
It told you what happened.
It did not yet help you understand why it happened, who owned it, or what context mattered during triage.
That gap became the core challenge.
"A great operational tool is not just about collecting data; it is about making that data usable when production pressure is high."
Making Alert Data Easier to See
The first problem was visibility.
We had alert data, but engineers had to dig through Slack messages and historical logs to understand patterns. That is fine on a calm day. It is painful during an incident.
So I proposed and built a Slack Home Heatmap: a simple monthly view of alert activity inside Slack.
The idea was straightforward:
- Show noisy days.
- Highlight quiet days.
- Make alert spikes visually obvious.
- Keep everything inside the workflow engineers already use.
I built a lightweight FastAPI service that processed historical alert data and generated dynamic heatmaps. Those heatmaps were uploaded through Cloudinary and rendered directly inside Slack.
It was not a massive feature, but it changed the experience.
Instead of asking, “Did alerts spike this week?”, teams could see the pattern instantly.
That was the first step in moving RobinRelay from a notification tool to an operational visibility layer inside Slack.
Then Came the Incident Context Problem
Once the product moved beyond scheduled summaries, the next challenge was incident context.
Engineers did not just need to know that alerts happened. They needed answers to operational questions:
- Who owns this service?
- What happened last time?
- Which alerts are related?
- Was this a one-off spike or a recurring production issue?
On paper, a generic assistant-style workflow looked like the fastest path.
In practice, it was difficult to trust.
Sometimes the answers were good. Sometimes they were vague. Sometimes they mixed ownership, summaries, and infrastructure context in ways that were not reliable enough for SRE work.
For a normal chatbot, “mostly correct” may be acceptable.
For production operations, it is not.
If an engineer asks, “Who handled the production outage last Tuesday?”, the system cannot guess. It needs to retrieve the right context, follow the right path, and answer with confidence.
That is when I realized the issue was not just answer quality.
The issue was the backend workflow around operational context.
Owning the Workflow
At one point, the incident-context direction was close to being dropped. The quality was not where it needed to be, and the generic assistant approach was difficult to debug.
I spent the weekend digging into retrieval strategies, indexing patterns, query routing, and chunking approaches.
The shift was simple but important:
Stop treating incident context as one generic request path.
Start treating it like a proper backend workflow.
Instead of sending every query through the same path, I broke the flow into intent-based retrieval.
def handle_sre_query(user_input):
intent = classify_intent(user_input)
if intent == "OWNERSHIP":
return fetch_service_owners(user_input)
if intent == "SUMMARY":
return generate_incident_timeline(user_input)
return run_contextual_lookup(user_input, intent)
This gave us more control.
Ownership questions could use one retrieval strategy. Incident summaries could use another. General infrastructure questions could still go through broader contextual search.
The results improved quickly because the system was no longer trying to answer every operational question the same way.
Reliability Guardrails
For SRE workflows, a wrong answer can be worse than no answer. I treated accuracy as a product reliability issue, not just a response-quality issue.
The improved flow separated common operational intents:
- ownership questions needed service and incident owner context,
- incident summaries needed timeline and alert-history context,
- infrastructure questions needed broader retrieval with tighter grounding.
This gave us a way to inspect failures. If ownership answers were weak, we could debug that retrieval path instead of blaming the whole system. If summaries were vague, we could tune timeline context separately.
That control is what helped push answer accuracy past 90%. The system became easier to reason about because each answer type had a clearer path from question to retrieved context to final Slack response.
Alert Noise Reduction
The Datadog work was not about hiding alerts. It was about making repeated noise visible enough to act on.
The Slack heatmap made recurring noisy periods easy to spot. Once noisy services and days were visible, the team could group related alerts, tune alert surfaces, and focus on alerts that actually needed engineering attention.
That workflow helped reduce alert noise by 75% while keeping the important production signals in front of the team.
From Bot Behavior to SRE Workflow
After improving the chunking, indexing, and retrieval flow, RobinRelay started feeling less like a scheduled script and more like an operational workflow inside Slack.
Engineers could ask things like:
- “Who resolved the uptick-api outage last Tuesday?”
- “Summarize alerts from the weekend.”
- “Why did database latency spike in production?”
- “Which service has been noisy this month?”
The system could connect alert frequency, ownership, timelines, and incident history.
That was the real product shift.
RobinRelay was no longer just sending alerts.
It was helping engineers make sense of alerts.
Technical Foundation
The stack was intentionally pragmatic. We focused on shipping useful SRE workflows first, then improving accuracy, visibility, and reliability.
*FastAPI- powered the backend services connecting Slack, Datadog alert data, and incident-context workflows.
*n8n- handled automation flows and integration steps across the system.
*Azure services- supported contextual retrieval for incident memory and operational question handling.
*Cloudinary- handled generated heatmap assets for Slack rendering.
*Slack API- served as the main product surface, keeping everything where engineering teams already worked.
The architecture stayed simple enough to move fast, but flexible enough to support deeper observability and triage workflows over time.
What I Learned
This project taught me that SRE automation is not just about connecting alerts to Slack.
The real value comes from the system around the alerts:
- how alert history is structured,
- how operational context is retrieved,
- how ownership is mapped,
- how noisy signals are grouped,
- and how the final response fits into the engineer’s workflow.
The biggest win was not adding a smarter interface.
The biggest win was turning raw alert noise into something engineers could actually use during real operational work.
Outcome
RobinRelay evolved from a scheduled Slack alert bot into a more capable SRE workflow for alert visibility, ownership lookup, and incident memory.
I contributed to that shift by building the Slack alert heatmap, improving infrastructure visibility, and redesigning the backend retrieval workflow to produce more reliable operational answers.
The final experience felt much closer to what the product wanted to become from day one:
an always-available SRE workflow inside Slack.

