on-call_guy's comments

on-call_guy · on Sept 7, 2023

Interesting. How is it different from PagerDuty offered integration with Slack? We can get notifications from Pagerduty right?

sudeepch8 · on Sept 7, 2023

Great questions! So PagerDuty does provide Slack integration. However, it posts incidents one by one as they are triggered. So your channel gets a lot of the messages. What we have built here for Daily notifications is one summary post which gives you clear idea Hom many incidents you had in a day, severity, status, assignee, etc. So you or your team or Engineering manager can easily get a sneak peak into the on-call health daily and stay on top without going through laundry list of notifications.

In addition, on weekly basis this app provides a report with .CSV file which has all the important details of PagerDuty incidents triggered during your on-call rotation. You can use it to run stats or mertics. In addition, we also provide markdown file which you can upload into your documents management system and just run your hnadoff call seamlessly without gathering all the information manually. Please note that PagerDuty charge $50+ per user/month to provide such a detailed report. Their base package only include basic report without key info. Hope this helps. Let me know if if you have any further questions.

on-call_guy · on Sept 7, 2023

Ok, thanks. Is there a limit on how many alerts we can get in this Slack report?

sudeepch8 · on Sept 8, 2023

Nope, there is no limit for number of alerts. In fact our summary notifications help managers or teams to view them more intuitively.

on-call_guy · on Sept 8, 2023

Ok great. will take a look

on-call_guy · on June 26, 2023

Yep, we do this in our team too (day time only though) and call it as "on-call buddy" for first couple of rotations.

on-call_guy · on June 26, 2023

I see...how does it really work though? Like if I am on-call for this week- would it show all my alerts in one place and then allow me to take some kind of actions? How does it solve the other issues like stale runbooks, etc?

on-call_guy · on June 22, 2023

lol...yeah we have that established within the team now. But still a challenge to find out a right contact across teams at 3am in the morning if the issue is from another team. May be we should build a service level ownership list so that we can tag them (in addition to their on-call). Curious to know what level of ownership you were referring about?

markus_zhang · on June 22, 2023

Basically the same as yours: who to call when shtf. The tricky part is that managers don't do trench work so once the developer responsible does not reply then the oncall has to figure it out.

on-call_guy · on June 26, 2023

Got it. I wish there was an up to date service level owner identified which will then reduce this to just a lookup and tagging. I have seen few engineering teams in other companies started doing that.

on-call_guy · on June 22, 2023

I agree they can be improved. But they are not one-time activity. It should be a continuous process in order to really be efficient. Also every on-call person needs to be diligent about it. Otherwise, it aggregates in future.

Again, I am talking about the teams who are heavily loaded with alerts and incidences during on-call. In general, all these pain points very much vary depending on the on-call load. We also have a teams who do manage all of these easily as their on-call load is quite manageable. But solving for them in one place so that everyone from the team is on the same would be amazing.

on-call_guy · on June 22, 2023

Ok...Would love to know more about what is out there. The problem with current alerting solutions like PagerDuty, they are every extensive in terms of what they offer (scheduling, reliability, etc) but not quite tailored towards needs of on-call engineer to easily tag something or for management to get a view which alerts/incidents need attention.

Even xMatter which my wife's team use but never login as it crashes frequently and prefer to debug through logs :P

hanifak · on June 22, 2023

Do you have an email I could reach you at? I’d like to share what I’m thinking the solution looks like and hear more about the challenges you’ve faced.

I totally agree that the tools which exist today cater towards the buyer. That is, the people with purchasing power who typically aren’t on-call. I’m building with a focus on the on-call experience for the people who are actually on call.

hanifak · on June 23, 2023

I just stood up my landing page if you want to keep up to date: https://simpleoncall.com/.

on-call_guy · on June 23, 2023

Thanks for the link. It doesn't have much details. I have signed up though. Looking forward for getting more details about the solutions.

on-call_guy · on June 22, 2023

Thank you. Its a great point and totally agree that a good management plays a big role in making life little easy. We did raise it to our management. But one of the limitations from their end is as well too many different tools and scattered information which do not give them full insights. For examples, its very hard to know which runbooks are stale and needs updates unless your frequently review them. Curious to know how such problems were solved?

dudul · on June 22, 2023

Why don't you update the run books? Why don't you modify the alerts/logs to give you more information? Why don't you create the missing run books when you run into undocumented issues?

db48x · on June 22, 2023

It sounds like none of those things have an owner who is tasked with keeping them up to date and correct. All the work that needs to be done needs to have a specific well—documented owner, otherwise diffusion of responsibility ensures that it will eventually fall through the cracks.

GianFabien · on June 22, 2023

Management's job is be "the owner". They are ultimately responsible to make sure that there is no diffusion of responsibility.

In our weekly meetings, recurring problems were identified and fixes implemented. No call was considered completed and closed until all relevant documents had been updated as appropriate. At the yearly review the quality of your documentation was as important as your time to respond, time to fix. That is how mission critical on-call work should be handled.

on-call_guy · on June 22, 2023

That's great. I think one of the issue in our process is we use wiki for on-call summary/hand-off notes. That's not ALWAYS very helpful as it has a dependency what engineers add to them. Also time and severity of the alerts make a difference as well. E.g. if they are triggered in the night/unfriendly time the first intuition of the engineer is to fix it and not to make a note or document unless there a easy way to do so. We use PagerDuty and I dont think it provides easy way to make those note or comments. So that leaves it to the engineers who need to do it after the fact. Some teammates do it rigorously where some dont. I think Management's challenge is also they can only push so much as it becomes an attrition risk :(

db48x · on June 22, 2023

Yea, that’s not uncommon. Personally I prefer to give each document a specific owner, but either way you do it someone has to be tasked with ensuring that the documentation is correct.

rozenmd · on June 22, 2023

It sounds like your team lacks a culture of continuous improvement - IMO in a product team on-call's full-time job is to make the next on-call engineer's job easier through deleting irrelevant alerts, automating fixes, and generally making the system more stable.

I wrote a longer guide about this here: https://onlineornot.com/incident-management/on-call/improvin...

on-call_guy · on June 22, 2023

Yeah, I must agree it is a cultural issue at some extent. But honestly the on-call my current company is quite demanding. So during the on-call week, though engineers try to improve it they always run out of the time or miss few things which then puts burden on future on-call.

I think there should be a nice light weight tool which should give a clear summary and tracking mechanism which make this a quicker tasks. Even just to tag the runbooks which are not updated. All those notes get lost in documentations and never referred back.

rozenmd · on June 23, 2023

In previous teams, we just used a JIRA backlog to manage these tasks

on-call_guy · on June 26, 2023

Yeah, JIRA could be handy and useful though you need to create tickets for every tasks with a rigorous monitoring with other backlog and story items.

on-call_guy · on June 21, 2023

Looks great