I love the idea, but this line: *> 1) no bug should take over 2 days* Is odd. It...

muixoozie · 2025-11-24T13:43:57 1763991837

I worked for a company that.. Used msql sever a lot and we would run into a heisenbug every few months that would crash our self hosted msql server cluster or it would become unresponsive. I'm not a database person so I'm probably butchering the description here. From our POV progress would stop and require manual intervention (on call). Back and forth went on with MS and our DBAs for YEARS pouring over logs or whatever they do.. Honestly never thought it would be fixed. Then one time it happened and we caught all the data going into the commit and realized it would 100% reproduce the crash. Only if we restored the database to a specific state and with this specific commit it would crash MS SQL Server. NDAs were signed and I took machete to our code base to create a minimal repro binary that could deserialize our data store and commit / crash MS SQL sever. Made a nice powershell script to wrap it and repro the issue fast and guess what? Within a month they fixed it. Was never clear on what exactly the problem was on their end.. I got buffer overflow vibes, but that's a guess.

DanielHB · 2025-11-24T18:19:27 1764008367

I once ran into a bug where our server code would crash only on a specific version of the Linux Kernel under a specific version of the OpenJDK that our client had. At least it would crash at startup but it was some good 2 weeks of troubleshooting because we couldn't change the target environment we were deploying on.

At least it crashed at startup, if it was random it would have been hell.

newtwilly · 2025-11-24T16:04:03 1764000243

Wow, that's pretty epic and satisfying

kykat · 2025-11-24T03:45:57 1763955957

Sometimes, a "bug" can be caused by nasty architecture with intertwined hacks. Particularly on games, where you can easily have event A that triggers B unless C is in X state...

What I want to say is that I've seen what happens in a team with a history of quick fixes and inadequate architecture design to support the complex features. In that case, a proper bugfix could create significant rework and QA.

arkh · 2025-11-24T08:12:38 1763971958

> Sometimes, a "bug" can be caused by nasty architecture with intertwined hacks

The joys of enterprise software. When searching for the cause of a bug let you discover multiple "forgotten" servers, ETL jobs, crons all interacting together. And no one knows why they do what they do how they do. Because they've gone away many years ago.

fransje26 · 2025-11-24T09:31:44 1763976704

> searching for the cause of a bug let you discover multiple "forgotten" servers, ETL jobs, crons all interacting together. And no one knows why they do [..]

And then comes the "beginner's" mistake. They don't seem to be doing anything. Let's remove them, what could possibly go wrong?

HelloNurse · 2025-11-24T11:32:24 1763983944

If you follow the prescribed procedure and involve all required management, it stops being a beginner's mistake; and given reasonable rollback provisions it stops being a mistake at all because if nobody knows what the thing is it cannot be very important, and a removal attempt is the most effective and cost efficient way to find out whether the ting can be removed.

Retric · 2025-11-24T13:57:07 1763992627

> a removal attempt is the most effective and cost efficient way to find out whether the ting can be removed

Cost efficient for your team’s budget sure, but a 1% chance of a 10+ million dollar issue is worth significant effort. That’s the thing with enterprise systems the scale of minor blips can justify quite a bit. If 1 person operating for 3 months could figure out what something is doing there’s scales where that’s a perfectly reasonable thing to do.

Enterprise covers a while range of situations there’s a lot more billion dollar orgs than trillion dollar orgs so your mileage may very.

HelloNurse · 2025-11-24T16:27:14 1764001634

If there is a risk of a 10+ million dollar issue there is also some manager whose job is to overreact when they hear the announcement that someone wants to eliminate thing X, because they know that thing X is a useful part of the systems they are responsible for.

In a reasonable organization only very minor systems can be undocumented enough to fall through the cracks.

Retric · 2025-11-24T17:40:10 1764006010

In an ideal world sure, but knowledge gets lost every time someone randomly quits, dies, retires etc.

Stuff that’s been working fine for years is easy for a team to forget about, especially when it’s a hidden dependency in some script that’s going to make some process quietly fail.

tremon · 2025-11-25T12:20:47 1764073247

The OP explicitly said "if you involve all required management", and that is key here. Having a process that is responsible for X million dollar of revenue yet is owned by no manager is a liability for the business (as is having an asset in operation that serves no purpose). Identifying that situation in a controlled manner is much better than letting it linger until it surfaces at a moment of Murphy's choosing.

> Stuff that’s been working fine for years is easy for a team to forget about

That's why serious companies have a documentation system describing their processes, tools and dependencies.

Retric · 2025-11-25T14:12:54 1764079974

The basic premise was it’s no longer obvious if a system is still doing anything useful. If the system had easy to locate documentation saying everything that used it then there wouldn’t be an issue, but that’s very difficult to maintain.

Documentation on every possible system that could use the resource would need to be accurate, complete, have someone locate and actually read it, remember, and communicate it with someone in a relevant meeting which may be taking place multiple levels of management above the reader here. As part of that chain when a new manager shows up and there’s endless seemingly minor details, so even if they actually did encounter that information at some point theirs nothing that particularly calls out as worth remembering at the time.

That’s a lot of individual points of failure which is why I’m saying in the real world even well run companies mess this stuff up.

chrisweekly · 2025-11-24T15:20:14 1763997614

Well, maybe. See Chesterson's Fence^1

[1] https://theknowledge.io/chestertons-fence-explained/

amalcon · 2025-11-24T16:06:38 1764000398

I have had several things over the course of my career that:

1) I was (temporarily) the only one still at the company who knew why it was there

2) I only knew myself because I had reverse engineered it, because the person who put it there had left the company

Now, some of those things had indeed become unnecessary over time (and thus were removed). Some of them, however, have been important (and thus were documented). In aggregate, it's been well worth the effort to do that reverse engineering to classify things properly.

notTooFarGone · 2025-11-24T10:46:44 1763981204

I've fixed more than enough bugs by just removing the code and doing it the right way.

Of course you can get lost on the way but worst case is you learn the architecture.

Mtinie · 2025-11-24T14:41:35 1763995295

If it’s done in a controlled manner with the ability to revert quickly, you’ve just instituted a “scream test[0].”

____

[0] https://open.substack.com/pub/lunduke/p/the-scream-test

(Obviously not the first description of the technique as you’ll read, but I like it as a clear example of how it works)

fragmede · 2025-11-24T10:38:31 1763980711

that's a management/cultural problem. if no one knows why it's there, the right answer is to remove it and see what breaks. If you're too afraid to do anything, for nebulous cultural reasons, you're paralyzed by fear and no one's operating with any efficiency. It hits different when it's the senior expert that everyone revere's that invented everything the company depends on that does it, vs a summer intern vs Elon Musk bought your company (Twitter). Hate the man for doing it messily and ungraciously, but you can't argue with the fact that it gets results.

ljm · 2025-11-24T11:30:19 1763983819

This does depend on a certain level of testing (automated or otherwise) for you to even be able to identify what breaks in the first place. The effect might be indirect several times over and you don't see what has changed until it lands in front of a customer and they notice it right away.

Move fast and break things is also a managerial/cultural problem in certain contexts.

mschuster91 · 2025-11-24T12:53:06 1763988786

> It hits different when it's the senior expert that everyone revere's that invented everything the company depends on that does it, vs a summer intern vs Elon Musk bought your company (Twitter). Hate the man for doing it messily and ungraciously, but you can't argue with the fact that it gets results.

You can only say with a straight face that if you're not the one responsible to clean up after Musk or whatever CTO sharted across the chess board.

C-levels love the "shut it down and wait until someone cries up" method because it gives easy results on some arbitrary KPI metric without exposing them to the actual fallout. In the worst case the loss is catastrophic, requiring weeks worth of ad-hoc emergency mode cleanup across multiple teams - say, some thing in finance depends on that server doing a report at the end of the year and the C-level exec's decision was made in January... but by that time, if you're in real bad luck, the physical hardware got sold off and the backup retention has expired. But when someone tries to blame the C-level exec, said C-level exec will defend themselves with "we gave X months of advance warning AND 10 months after the fact no one had complained".

faidit · 2025-11-24T13:26:50 1763990810

It can also be dangerous to be the person who blames execs. Other execs might see you as a snake who doesn't play the game, and start treating you as a problem child who needs to go, your actual contributions to the business be damned. Even if you have the clout to piss off powerful people, you can make an enemy for life there, who will be waiting for an opportunity to blame you for something, or use their influence to deny raises and resources to your team.

Also with enterprise software a simple bug can do massive damage to clients and endanger large contracts. That's often a good reason to follow the Chesterton's fence rule.

tremon · 2025-11-25T12:34:57 1764074097

C-levels love the "shut it down and wait until someone cries up" method because it gives easy results on some arbitrary KPI metric without exposing them to the actual fallout

It's not in the C-level's job description to manage the daily operations of the company, they have business managers to do that. If there's an expensive asset in the company that's not (actively) owned by any business manager, that's a liability -- and it is in the C-level's job description to manage liabilities.

said C-level exec will defend themselves with "we gave X months of advance warning AND 10 months after the fact no one had complained"

And that's a perfectly valid defense, they're acting true to their role. The failure lies with the business/operations manager not being in control of their process tooling.

xnorswap · 2025-11-24T14:07:29 1763993249

The next mistake is thinking that completely re-writing the system will clean out the cruft.

silvestrov · 2025-11-24T08:44:19 1763973859

plus report servers and others that run on obsolete versions of Windows/unix/IBM OS plus obsolete software versions.

and you just look at this and thinks: one day, all of this is going to crash and it will never, ever boot again.

lovich · 2025-11-25T02:34:32 1764038072

I still have nightmares of load bearing Perl scripts and comlink interops, and then of course our dear friend the GAC

groestl · 2025-11-24T11:44:21 1763984661

And then it turns out the bug is actually very intentional behavior.

ChrisMarshallNY · 2025-11-24T03:48:21 1763956101

In that case, maybe having bug fixing be a two-step process (identify, then fix), might be sensible.

OhMeadhbh · 2025-11-24T04:38:39 1763959119

I do this frequently. But sometimes identifying and/or fixing takes more than 2 days.

But you hit on a point that seems to come up a lot. When a user story takes longer than the alloted points, I encourage my junior engineers to split it into two bugs. Exactly like what you say... One bug (or issue or story) describing what you did to typify the problem and another with a suggestion for what to do to fix it.

There doesn't seem to be a lot of industry best practice about how to manage this, so we just do whatever seems best to communicate to other teams (and to ourselves later in time after we've forgotten about the bug) what happened and why.

Bug fix times are probably a pareto distribution. The overwhelming majority will be identifiable within a fixed time box, but not all. So in addition to saying "no bug should take more than 2 days" I would add "if the bug takes more than 2 days, you really need to tell someone, something's going on." And one of the things I work VERY HARD to create is a sense of psychological safety so devs know they're not going to lose their bonus if they randomly picked a bug that was much more wicked than anyone thought.

ljm · 2025-11-24T15:56:00 1763999760

I like to do this as a two-step triage because one aspect is the impact seen by the user and how many it reaches, but the other is how much effort it would take to fix and how risky that is.

Knowing all of those aspects and where an issue lands makes it possible to prioritise it properly, but it also gives the developer the opportunity hone their investigation and debugging skills without the pressure to solve it at the same time. A good write-up is great for knowledge sharing.

ChrisMarshallNY · 2025-11-24T05:38:40 1763962720

You sound like a great team leader.

Wish there were more like you, out there.

marginalia_nu · 2025-11-24T09:52:08 1763977928

I think in general, bugs go unfixed in two scenarios:

1. The cause isn't immediately obvious. In this case, finding the problem is usually 90% of the work. Here it can't be known how long finding the problem is beforehand, though I don't think bailing because it's taking too long is a good idea. If anything, it's those really deep rabbit holes the real gremlins can hide.

2. The cause is immediately obvious, but is an architecture mistake, the fix is a shit-ton of work, breaks workflows, requires involving stakeholders, etc. Even in this case it can be hard to say how long it will take, especially if other people are involved and have to sign off on decisions.

I suppose it can also happen in low-trust sweatshops where developers held on such a tight leash they aren't able to fix trivial bugs they find without first going through a bunch of jira rigmarole, which is sort of low key the vibe I got from the post.

QuiEgo · 2025-11-24T14:20:20 1763994020

As someone who works with hardware, hard to repo bugs can take months to track down. Your code, the compiler, or the hardware itself (which is often a complex ball of IP from dozens of manufacturers held together with a NoC) could all be a problem. The extra fun bugs are when a bug is due to problems in two or three of them combining together in the perfect storm to make a mega bug that is impossible to reproduce in isolation.

QuiEgo · 2025-11-24T14:44:18 1763995458

Random example: I once worked on a debug where you were not allowed to send zero length packets due to a known HW bug. Okay fine, work around in SW. Turns out there was an HW eviction timer that was disabled. It was connected to a counter that counted sys clk ticks. Turns out it was not disabled entirely properly due to SW bug, so once every 2^32 ticks, it would trigger an evection, and if the queue happened to be empty, it would send a ZLP, which triggered the first bug (hard hang the system in a way that breaks the debugger). There were dozens of ways that could hard hang the system, this was just one. Good luck debugging that in two days.

jeffreygoesto · 2025-11-24T15:21:56 1763997716

We had one where data, interpreted as address (simple C typo before static analysis was common) fell into an unmapped memory region and the PCI controller stalled trying to get a response, thereby also halting the internal debugging logic and JTAG just stopped forever (PPC603 core). Each time you'd hit the bug, the debugger was thrown off.

OhMeadhbh · 2025-11-24T04:31:35 1763958695

At Amazon we had a bug that was the result of a compiler bug and the behaviour of intel cores being mis-documented. It was intermittent and related to one core occasionally being allowed to access stale data in the cache. We debugged it with a logic analyzer, the commented nginx source and a copy of the C++ 11 spec.

It took longer than 2 days to fix.

ChrisMarshallNY · 2025-11-24T05:37:26 1763962646

I’m old enough to have used ICEs to trace program execution.

They were damn cool. I seriously doubt that something like that, exists outside of a TSMC or Intel lab, these days.

plq · 2025-11-24T05:59:52 1763963992

ICE meaning in-circuit emulator in this instance, I assume?

ChrisMarshallNY · 2025-11-24T10:01:53 1763978513

Yeah. Guess it’s kind of a loaded acronym, these days.

Windchaser · 2025-11-24T15:59:01 1763999941

/imagining using an internal combustion engine here

OhMeadhbh · 2025-11-25T05:55:44 1764050144

"Rejecting this pull request because the patch you submitted does not provide enough torque."

buildbot · 2025-11-24T22:26:27 1764023187

They float around on ebay! Software might be an issue.

amoss · 2025-11-24T07:10:18 1763968218

When you work on compilers, all bugs are compiler bugs.

(apart from the ones in the firmware, and the hardware glitches...)

auguzanellato · 2025-11-24T05:33:17 1763962397

What kind of LA did you use to de bug an Intel core?

OhMeadhbh · 2025-11-24T06:41:10 1763966470

The hardware team had some semi-custom thing from intel that spat out (no surprise) gigabytes of trace data per second. I remember much of the pain was in constructing a lab where we could drive a test system at reasonable loads to get the buggy behavior to emerge. It was intermittent so it took use a couple weeks to come up with theories, another couple days for testing and a week of analysis before we came up triggers that allowed us to capture the data that showed the bug. it was a bit of a production.

Aurornis · 2025-11-24T18:34:35 1764009275

All of the buggy software projects I've been employed to work on have had some version of this rule.

Usually it's implicit, rather than explicit: Nobody tells you to limit work on bugs to 1-2 days, but if you spend an entire week debugging something difficult and don't accumulate any story points in Jira, a cadre of project manager, program managers, and other manager titles you didn't even know existed will descend upon you and ask why you're dragging the velocity down.

Lesson learned: Next time, avoid the hard bugs and give up early if something isn't going to turn into story points for hidden charts that are viewed by more people than you ever thought.

kccqzy · 2025-11-24T19:21:20 1764012080

I hate this kind of management culture that misuses story points. Story points are supposed to take into account difficulty. So if you spend an entire week debugging a difficult bug, you should’ve accumulated about the same amount of story points as colleagues debugging ten easy bugs.

int_19h · 2025-11-25T04:13:52 1764044032

Just about everything about Agile as it is actually practiced IRL by most workplaces is "misuses X".

At some point one can't help but wonder: if almost everyone is "misusing" it, then maybe it's a problem with the methodology itself, and the people for whom it works would have worked just as well organically without it?

oldestofsports · 2025-11-24T20:20:07 1764015607

Every one have different approaches to story points and every one thinks their way is ”the right way”. In the end they just turn into an abstraction layer for man hours.

aeternum · 2025-11-24T19:25:57 1764012357

It's the right lesson because the difficulty of the bug often depends on the dev. For example it might take one dev weeks to figure out that a hang due to a sleep(.001) call within asyncio whereas another can identify it with a glance at the code.

1718627440 · 2025-11-24T21:27:10 1764019630

Which is why they get payed different rates.

aeternum · 2025-11-26T05:43:29 1764135809

Except they often don't.

bottlero_cket · 2025-11-24T21:15:43 1764018943

Lesson learned, just avoid the hard bugs, I don’t think that is feasible for most of us!

lapcat · 2025-11-24T03:49:34 1763956174

> It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.

This is explained later in the post. The 2 day hard limit is applied not to the estimate but rather to the actual work: "If something is ballooning, cut your losses. File a proper bug, move it to the backlog, pick something else."

ChrisMarshallNY · 2025-11-24T03:55:26 1763956526

Most of the work in finding/fixing bugs is reproducing them reliably enough to determine the root cause.

Once I find a bug, the fix is often negligible.

But I can get into a rabbithole, tracking down the root cause. I don’t know if I’ve ever spent more than a day, trying to pin down a bug, but I have walked away from rabbitholes, a couple of times. I hate doing that. Leaves an unscratchable itch.

PaulKeeble · 2025-11-24T03:56:32 1763956592

Sometimes you find the cause of the bug in 5 minutes because its precisely where you thought it was, sometimes its not there and you end up writing some extra logging to hopefully expose its cause in production after the next release because you can't reproduce as its transient. I don't know how to predict how long a bug will take to reproduce and track down and only once its understood do we know how long it takes to fix.

khannn · 2025-11-24T11:19:52 1763983192

I had a job that required estimation on bug tickets. It's honestly amazing how they didn't realize that I'd take my actual estimate, then multiply it by 4, then use the extra time to work on my other bug tickets that the 4x multiplier wasn't good enough for.

mewpmewp2 · 2025-11-24T11:30:17 1763983817

That's just you hedging, they don't really need to know that. As long as if you are hedging accurately in the big picture, that's all that matters. They need estimates to be able to make decisions on what should be done and what not.

You could tell them that 25% chance it's going to take 2 hours or less, 50% chance it's going to take 4 hours or less, 75% chance it's going to take 8 hours or less, 99% it's going to take 16 hours or less, to be accurate, but communication wise you'll win out if you just call items like those 10 hours or similar intuitively. Intuitively you feel that 10 hours seems safe with those probabilities (which are intuitive experience based too). So you probably would say 10 hours, unless something really unexpected (the 1%) happens.

Btw in reality with above probabilities the actual average would be 5h - 6h with 1% tasks potentially failing, but even your intuitive probability estimations could be off so you likely want to say 10h.

But anyhow that's why story points are mostly used as well, because if you say hours they will naturally think it's more fixed estimation. Hours would be fine if everyone understood naturally that it implies a certain statistical average of time + reasonable buffer it would take over a large amount of similar tasks.

georgemcbay · 2025-11-24T16:14:28 1764000868

Are you sure they didn't realize it...?

Virtually everywhere I've ever worked has had an unwritten but widely understood informal policy of placing a multiple on predicted effort for both new code/features and bug fixing to account for Hofstadter's law.

khannn · 2025-11-26T11:03:32 1764155012

There are a lot of scrum masters who have never actually programmed

ZaoLahma · 2025-11-24T10:25:13 1763979913

> That said, unless fixing a bug requires a significant refactor/rewrite, I can’t imagine spending more than a day on one.

The longer I work as a software engineer, the rarer it is that I get to work with bugs that take only a day to fix.

ChrisMarshallNY · 2025-11-24T10:27:54 1763980074

I've found the opposite to be true, in my case.

ZaoLahma · 2025-11-24T11:55:20 1763985320

For me the longer I work, the worse the bugs I work with become.

Nowadays, after some 17 years in the business, it's pretty much always intermittently and rarely occurring race conditions of different flavors. They might result in different behaviors (crashes, missing or wrong data, ...), but at the core of it, it's almost always race conditions.

The easy and quick to fix bugs never end up with me.

lll-o-lll · 2025-11-24T12:07:39 1763986059

Yep. Non-determinism. Back in the day it was memory corruption caused by some race condition. By the time things have gone pop, you’re too far from the proximate cause to have useful logs or dumps.

“Happens only once every 100k runs? Won’t fix”. That works until it doesn’t, then they come looking for the poor bastard that never fixes a bug in 2 days.

ChrisMarshallNY · 2025-11-24T12:12:57 1763986377

My first job was as an RF (microwave) bench technician. My initial schooling was at a trade school for electronic technicians.

It was all about fixing bugs; often, terrifying ones.

That background came in handy, once I got into software.

lll-o-lll · 2025-11-24T12:26:20 1763987180

I started life as an engineer. Try reverse engineering why an electrical device your company designed (industrial setting, so big power), occasionally and I mean, really really rarely, just explodes; burying its cover housing half way through the opposite wall.

Won’t fix doesn’t get accepted so well. Trying to work out what the hell happened from the charred remains isn’t so easy either.

ChrisMarshallNY · 2025-11-24T12:37:27 1763987847

Sounds like some great stories.

int_19h · 2025-11-25T04:20:26 1764044426

The worst bug in my career was when the app would reliably crash if you left it running for "long enough" - but still non-probabilistically, so sometimes it would happen in an hour, sometimes in three. The crash itself was quickly diagnosed as a corrupt vtable, but finding the piece of code that had a pointer bug in it that just happened to write into (some) object's vtable in certain situations that triggered a race condition took many days.

ChrisMarshallNY · 2025-11-24T12:06:05 1763985965

The reward for good work, is more work.

I tend to mostly work alone, these days (Chief Cook & Bottle-Washer).

All bugs are mine.

sfink · 2025-11-25T02:59:29 1764039569

What kind of kitchen are you working in where bugs are a concern??!

bagacrap · 2025-11-25T05:50:26 1764049826

You must work on very simple codebases

ChrisMarshallNY · 2025-11-25T10:11:49 1764065509

Yes and no.

I tend to work alone, so my scope is limited.

Some of the stuff I work on is quite involved, anyway.

I’ve been at this game awhile (coding for over 40 years), so I have learned a few tricks.

Of course, I “cheat.” I’ve learned to write software that doesn’t tend to have that many bugs, and I also don’t have to deal with other people’s code, so much. I write code for myself, which means that I don’t get to practice my debugging, so much, these days.

You can see for yourself. Much of my work is open-source, or source-available: https://github.com/ChrisMarshallNY

brightball · 2025-11-24T13:10:50 1763989850

In my experience, the vast majority of bugs are quick fixes that are easy to isolate or potentially even have a stack trace associated with them.

There will always be those “only happens on the 3rd Tuesday every 6 months” issues that are more complicated but…if you can get all the small stuff out of the way it’s much easier to dedicate some time to the more complicated ones.

Maximizing the value of time is the real key to focusing on quicker fixes. If nobody can make a case why one is more important than other, then the best use of your time is the fastest fix.

sshine · 2025-11-24T09:33:58 1763976838

> unless fixing a bug requires a significant refactor/rewrite, I can’t imagine spending more than a day on one

Race conditions in 3rd party services during / affected by very long builds and with poor metrics and almost no documentation. They only show up sometimes, and you have to wait for it to reoccur. Add to this a domain you’re not familiar with, and your ability to debug needs to be established first.

Stack two or three of these on top of each other and you have days of figuring out what’s going on, mostly waiting for builds, speculating how to improve debug output.

After resolving, don’t write any integration tests that might catch regressions, because you already spent enough time fixing it, and this needs to get replaced soon anyway (timeline: unknown).

chii · 2025-11-24T03:44:29 1763955869

I find most bugs take less time to fix than it takes time to verify and reproduce.

wahnfrieden · 2025-11-24T04:09:40 1763957380

LLMs have helped me here the most. Adding copious detailed logging across the app on demand, then inspecting the logs to figure out the bug and even how to reproduce it.

bluGill · 2025-11-24T15:07:19 1763996839

I did that once: logging ended up taking 80% of the CPU leaving not enough overhead for everything else the system should do. Now I am more careful to figure out what is worth logging at all, and also to make sure disabled logs are quickly bypassed.

wahnfrieden · 2025-11-24T19:27:09 1764012429

You misunderstand: I remove the logging as soon as the task is done. I definitely do not keep the LLM logging around.

That's the beauty of it - it's able to add and remove huge amounts of logging per task, so I never need to manage the scale and complexity of logging that outlasts the task it was purposefully added for. With typical development, adding logging takes time so we keep it around and maintain it.

bluGill · 2025-11-24T19:30:53 1764012653

One of my needs is when something breaks in the real world I can figure out why. Bugs that happen at my desk I do what you said, add the logs I need and then delete them when it is fixed. However often there are things that I can't figure out how to reproduce at my desk and so I need logs that are always running on the off chance a new bug happens that I need to debug.

wahnfrieden · 2025-11-24T19:33:01 1764012781

Yea that's valid. I do keep some kinds of logs around for this. But I'm selective with it and most logs I don't need to retain to manage this risk.

dylan604 · 2025-11-24T17:31:30 1764005490

we've gotten into adding verbosity levels in logging where each logged event comes with an assigned level that only makes it to the log if it matches the requested log level. there are times when a full verbose output is just too damn much for day-to-day debugging, but is helpful when debugging the one feature.

i used to think options like -vvv or -loglevel panic were just someone being funny, but they do work when necessary. -loglevel sane, -loglevel unsane, -loglevel insane would be my take but am aware that most people would roll their eyes so we're lame using ERROR, WARNING, INFO, VERBOSE

wahnfrieden · 2025-11-24T19:28:51 1764012531

That's great when you have to maintain a large amount of logs for weeks, months, years.

But I'm talking about adding and removing logs per dev task. There's really no need to have sophisticated log levels and maintaining them as the app evolves and grows, because the LLM can "instantly" add and remove the logging it needs per granular task. This is much faster for me than maintaining logs and carefully selecting log levels and managing how logs can be filtered. That only made sense to me when it took actual dev effort to add or remove these logs.

bluGill · 2025-11-24T18:41:24 1764009684

On smaller projects that works. We have a complex system where individual logs can get the log level changed. Though this turns out too fine grained. I'm moving to every subsystem being controllable, but not the individual logs. I'm still not sure what the right answer is though - it always seems like there are 10,000 lines of unrelated useless logs to wade through before finding the useful one, but anytime I remove something that turns out to be the needed log for the very next bug report...

1718627440 · 2025-11-24T21:32:05 1764019925

Use something like syslog, where everything is recorded and you can filter on display by subsystem and loglevel.

ChrisMarshallNY · 2025-11-24T05:31:24 1763962284

Yes. I often just copy the whole core dump, and feed it into the prompt.

criddell · 2025-11-24T14:49:19 1763995759

This is something that I've been trying to improve at. I work on a Windows application and so I get crash dumps that I open with WinDbg and then I usually start looking for exceptions.

Is this something an LLM could help with? What exactly do you mean when you say you feed a dump to the prompt?

ChrisMarshallNY · 2025-11-24T15:43:45 1763999025

I literally copy the whole stack dump from the log, and paste it into the LLM (I find that ChatGPT does a better job than Claude), along with something along the lines of:

> I am getting occasional crashes on my iOS 17 or above UIKit program. Given the following stack trace, what problem do think it might be?

I will attach the source file, if I think I know the general area, along with any symptoms and steps to reproduce. One of the nice things about an LLM, is that it's difficult to overwhelm with too much information (unlike people).

It will usually respond with a fairly detailed analysis. Usually, it has some good ideas to use as starting points.

I don't think "I have a bug. Please fix it." would work, though. It's likely to try, but caveat emptor.

int_19h · 2025-11-25T04:16:26 1764044186

I kinda wonder if at some point this is something we might use the LLM more directly for. As in, train them on raw binary dumps as input.

ChrisMarshallNY · 2025-11-25T17:34:10 1764092050

I wonder if we’ll be seeing tools that do this.

I could see Apple or Microsoft, building it into their IDEs.

But, as was noted elsewhere, I think it’s only useful as an advisor. I think a lot of folks look at LLMs as some kind of programmer replacement.

wahnfrieden · 2025-11-25T17:59:04 1764093544

They are that too

ChrisMarshallNY · 2025-11-25T20:06:32 1764101192

I still wouldn't trust them for a lot of stuff.

Some of the code I get from Claude and ChatGPT is ... not so good.

int_19h · 2025-11-26T17:00:17 1764176417

It's like an intern that is incapable of learning. But a very enthusiastic one.

wahnfrieden · 2025-11-25T22:11:45 1764108705

I review it and I sometimes have it retry the same task 40+ times

Lionga · 2025-11-24T07:01:38 1763967698

And this kids is how one bug got fixed and two more were created

Sohcahtoa82 · 2025-11-24T17:58:11 1764007091

There's a huge difference between using an LLM to assist you versus letting it just do all the work for you. Your implication that they're the same, and that the previous commenter let the LLM do the work, is lazy.

ChrisMarshallNY only said they fed the dump into the LLM. They said nothing about using the LLM to write the fix.

ChrisMarshallNY · 2025-11-24T10:05:05 1763978705

Nope.

Good result == LLM + Experience.

The LLM just reduces the overhead.

That’s really what every “new paradigm” has ever done.

enraged_camel · 2025-11-24T14:19:35 1763993975

Also, robust test coverage helps prevent regressions.

beberlei · 2025-11-24T10:29:24 1763980164

Its odd at first, but springs from economic principles, mainly sunk cost fallacy.

If you invest 2 days of work and did not find the root cause of a bug, then you have the human desire to keep investing more work, because you already invested so much work. At that point however its best to re-evaluate and do something different instead, because it might have a bigger impact.

Likelihood that after 2 days of not finding the problem, you wont find it after another 2 days is higher than starting over with another bug that on average you find the problem earlier.

lan321 · 2025-11-24T12:52:35 1763988755

This sounds incorrect. You didn't find it but you're gaining domain knowledge and excluding options, hopefully narrowing down the cause. It's not like you're just chucking random garbage at Jenkins.

Of course, if it's a difficult bug and you can just say 'fuck it' and bury it in the backlog forever that's fine, but in my experience the very complex ones don't get discovered or worked on at all unless it's absolutely critical or a customer complains.

pjc50 · 2025-11-24T10:00:44 1763978444

I think the worst case I encountered was something like two years from first customer report to even fully confirming the bug, followed by about a month of increasingly detailed investigations, a robot, and an osciliscope.

The initial description? "Touchscreen sometimes misses button presses".

ChrisMarshallNY · 2025-11-24T10:31:56 1763980316

Thanks.

I love hearing stories like this.

pjc50 · 2025-11-24T11:05:05 1763982305

I'm no Raymond Chen, but sometimes I wish I'd kept notes on interesting bugs that I could take with me when I moved jobs. I've often been the go-to guy for weird shit that is happening that nobody else understands and requires cross-disciplinary insight.

Other favourites include "Microsoft Structured Exception Handling sometimes doesn't catch segfaults", and "any two of these network devices work together but all three combined freak out".

oldestofsports · 2025-11-24T20:16:58 1764015418

> It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.

I understood it as the whole point of the 2 day hard limit - you start working on a bug that turn out to be bigger than expected, so you write down your findings and move on to the next one.

peepee1982 · 2025-11-24T14:55:54 1763996154

Yep. Also, sometimes you figure out a bug and in the process you find a whole bunch of new ones that the first bug just never let surface.

thfuran · 2025-11-24T16:07:28 1764000448

>I can’t imagine spending more than a day on one.

You mean starting after it has been properly tracked down? It can often take a whole lot of time to go from "this behavior is incorrect sometimes" to "and here's what need to change".

ChrisMarshallNY · 2025-11-24T16:15:16 1764000916

Depends. If it takes a long time to track down, then it should either be sidelined, or the design needs to be revisited.

I have found that really deep bugs are the result of bad design, on my part, and applying "band-aid" fixes often just kicks the can down the road, for a reckoning (that is now just a bit worse), later.

If it is not super-serious (small performance issues, for instance; which can involve moving a lot of cheese), I can often schedule a design review for a time when it's less critical, and maybe set up an exploration branch.

People keep bringing up threading and race conditions, which are legitimately nasty bugs.

In my experience, they are often the result of bad design, on my part. It's been my experience that "thread everything" can be a recipe for disaster. The OS/SDK will often do internal threading, and I can actually make things worse, by running my own threads.

I try to design stuff that will work fine, in any thread, which gives me the option to sequester it into a new thread, at a later time (I just did exactly that, a few days ago, in a Watch app), but don't immediately do that.

bagacrap · 2025-11-24T16:33:26 1764002006

> If it takes a long time to track down, then it should either be sidelined, or the design needs to be revisited.

I don't get this. Either you give up on the bug after a day, or you throw out the entire codebase and start over?

Sure, if the bug is low severity and I don't have a reproduction, I will ignore it. But there are bad bugs that are not understood and can take a lot more than a day to look into, such as by adding telemetry to help track it down.

Yes, it is usually the case that tracking it down is harder than fixing. But there are also cases where the larger system makes some broad assumptions which are not true, and fixing is tricky. It is not usually an option to throw out the entire system and start over each time this happens in a project.

ChrisMarshallNY · 2025-11-24T17:18:39 1764004719

> you throw out the entire codebase and start over

Nah. That’s called “catastrophic thinking.” This is why it’s important (in my experience) to back off, and calm down.

I’ll usually find a way to manage a smaller part of the codebase.

If I make decisions when I’m stressed, Bad Things Happen.

Uehreka · 2025-11-24T04:37:55 1763959075

> It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.

In my experience there are two types of low-priority bugs (high-priority bugs just have to be fixed immediately no matter how easy or hard they are).

1. The kind where I facepalm and go “yup, I know exactly what that is”, though sometimes it’s too low of a priority to do it right now, and it ends up sitting on the backlog forever. This is the kind of bug the author wants to sweep for, they can often be wiped out in big batches by temporarily making bug-hunting the priority every once in a while.

2. The kind where I go “Hmm, that’s weird, that really shouldn’t happen.” These can be easy and turn into a facepalm after an hour of searching, or they can turn out to be brain-broiling heisenbugs that eat up tons of time, and it’s difficult to figure out which. If you wipe out a ton of category 1 bugs then trying to sift through this category for easy wins can be a good use of time.

And yeah, sometimes a category 1 bug turns out to be category 2, but that’s pretty unusual. This is definitely an area where the perfect is the enemy of the good, and I find this mental model to be pretty good.

tonyedgecombe · 2025-11-24T08:44:30 1763973870

>high-priority bugs just have to be fixed immediately no matter how easy or hard they are

The fact that something is high priority doesn't make it less work.

ChrisMarshallNY · 2025-11-24T11:45:56 1763984756

Or more.

I often find the nastiest bugs are the quickest fixes.

I have a "zero-crash" policy. Crashes are never acceptable.

It's easy to enforce, because crashes are usually easy to find and fix.

$> ThreadingProblems has entered the chat

lkbm · 2025-11-24T22:55:37 1764024937

A big reason we did a "fix week" at my old job was to deal with all the simple, low priority issues. Sure, there were high severity bugs, but they would get prioritized during normal work, whereas fix week was to prevent death of a thousand cuts. Kinda trivial things that just accumulate and make the site look and feel janky.

Some things turn out to be surprisingly complex, but you can very often know that the simple thing is simple.

JJMcJ · 2025-11-24T15:59:07 1763999947

It's like remodeling. The drywall comes down. Do you just put up a new sheet or do you need to reframe one wall of the house?

claw-el · 2025-11-24T18:14:07 1764008047

> Also, I tend to attack bugs by priority/severity, as opposed to difficulty.

This is one part that is rarely properly implemented. We have our bug bash days too, but I noticed after the fact that maybe 1/3 of the bugs we solved is on a feature we are thinking of deprecating soon due to low usage.

How can we attack bugs better by priority?

huherto · 2025-11-24T14:33:16 1763994796

I do agree that you should be able to fix most bugs in 2 days or less. If you have many bugs taking longer to fix, it may be an indication that you may have systemic issues. (e.g design, architectural, tooling, environment access, test infrastructure, etc)

bluGill · 2025-11-24T15:05:11 1763996711

Sure, but you never know if this next bug is another fix it in 1 hour, or it will take months to figure out. I have had a few "The is not spelled 'Teh'" bugs that it takes longer to find the code in question with grep than to fix, but most are a not that obvious and so you don't know if there are 2 hours left or not until 2 hours latter when you know you found something or are still looking. (or unless you think you fixed it and the time to verify the test is about 2 hours, but then only if your fix worked)

AbstractH24 · 2025-11-24T14:52:31 1763995951

> Is odd. It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.

Learning how to better estimate how long tasks take is one of my biggest goals. And one I've yet to even figure out how to master

SergeAx · 2025-11-30T03:36:45 1764473805

> It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.

I think what they mean is that after 2 days of working on bug you stop it regardless the result, leaving paper trail behind for the next person.

mobeigi · 2025-11-24T11:08:15 1763982495

I believe the idea is to pick small items that you'd likely be able to solve quickly. You don't know for sure but you can usually take a good guess at which tasks are quick.

michaelbuckbee · 2025-11-24T17:22:03 1764004923

Something I often find are "categorical" bugs where it's really 3 or 4 different bugs in a trench coat all presenting as a single issue.

dockd · 2025-11-24T16:35:24 1764002124

How is this for a rule of thumb: the time it takes to fix a bug is directly related to the age of the software.

ChrisMarshallNY · 2025-11-24T19:01:39 1764010899

That's also a "It Depends™" thing.

Really old software can be referred to as "Mature," as opposed to "Decrepit." It can be extremely well-documented, and well-understood. Many times, there are tools that grow up, alongside the main code.

I wrote stuff that was still in use, 25 years later, because the folks that took it over, did a really good job of maintaining it.

j45 · 2025-11-24T05:12:49 1763961169

Bugs taking less than 2 days are great to have as a target but will not be something that can be guaranteed.

RossBencina · 2025-11-24T05:54:23 1763963663

Next up: a new programming language or methodology that guarantees all bugs take less than two days to fix.

wiredfool · 2025-11-24T20:13:21 1764015201

I think this, like many problems, can be reefactored into the halting problem. Which we know how to solve…. Right?

Viliam1234 · 2025-11-25T22:01:08 1764108068

We definitely know how to create a Jira ticket for it, and the rest is developer's problem.

cvoss · 2025-11-24T22:43:31 1764024211

The article addresses your concerns directly.

> In one of our early fixits, someone picked up what looked like a straightforward bug. It should have been a few hours, maybe half a day. But it turned into a rabbit hole. Dependencies on other systems, unexpected edge cases, code that hadn’t been touched in years.

> They spent the entire fixit week on it. And then the entire week after fixit trying to finish it. What started as a bug fix turned into a mini project. The work was valuable! But they missed the whole point of a fixit. No closing bugs throughout the week. No momentum. No dopamine hits from shipping fixes. Just one long slog.

> That’s why we have the 2-day hard limit now. If something is ballooning, cut your losses. File a proper bug, move it to the backlog, pick something else. The limit isn’t about the work being worthless - it’s about keeping fixit feeling like fixit.

yxhuvud · 2025-11-24T13:44:29 1763991869

I've seen people spending 4 months on a hard to replicate segfault.

w0m · 2025-11-24T13:54:36 1763992476

> That said, unless fixing a bug requires a significant refactor/rewrite, I can’t imagine spending more than a day on one.

oh sweet sweet summer child...

ahoka · 2025-11-24T10:23:06 1763979786

Not sure why would you ever need to refactor for fixing a bug?

ChrisMarshallNY · 2025-11-24T10:31:24 1763980284

Oh, that's because a bug in requirements or specification is usually a killer.

I have encountered areas where the basic design was wrong (often comes from rushing in, before taking the time to think things through, all the way).

In these cases, we can either kludge a patch, or go back and make sure the design is fixed.

The longer I've been working, the less often I need to go back and fix a busted design.

nemetroid · 2025-11-24T21:05:03 1764018303

A nice way to fix bugs is to make the buggy state impossible to represent. In cases where a bug was caused by some fundamental flaw in the original design, a redesign might be the only way to feel reasonably confident about the fix.

mat0 · 2025-11-24T07:45:29 1763970329

you cannot know. that’s why the post elaborates saying (paraphrasing) “if you realize it’s taking longer, cut your losses and move on to something else”

jorvi · 2025-11-24T15:16:42 1763997402

Yeah, "no bug should take over 2 days" tells me you've never had a race condition in your codebase.

ChrisMarshallNY · 2025-11-24T15:52:19 1763999539

I'm sure that you're right. I'm likely a bad, inexperienced engineer. There's a lot of us, out here.

jorvi · 2025-11-25T01:01:10 1764032470

I'm sure your sarcasm is right. You're likely a good, godlike engineer that would fix even the most intractable race conditions within 48 hours. There's a lot of you, out there.

ChrisMarshallNY · 2025-11-25T01:40:57 1764034857

Have a great day!

triyambakam · 2025-11-24T03:50:08 1763956208

> It’s virtually impossible for me to estimate how long it will take to fix a bug, until the job is done.

Now I find that odd.

gyomu · 2025-11-24T03:52:40 1763956360

I don’t. I worked on firmware stuff where unexplainable behavior occurs; digging around the code, you start to feel like it’s going to take some serious work to even start to comprehend the root cause; and suddenly you find the one line of code that sets the wrong byte somewhere as a side effect, and what you thought would fill up your week ended up taking 2 hours.

And sometimes, the exact opposite happens.

kubb · 2025-11-24T11:52:43 1763985163

You might get humbled by overwhelming complexity one day. Enjoy the illusion of perfect insight until then.

triyambakam · 2025-11-25T00:45:26 1764031526

I didn't say it must be always correct

ChrisMarshallNY · 2025-11-24T03:51:57 1763956317

Yeah, I’m obviously a terrible programmer. Ya got me.

triyambakam · 2025-11-24T04:43:00 1763959380

I just find it so oversimplified that I can't believe you're sincere. Like you have entirely no internal heuristic for even a coarse estimation of a few minutes, hours, or days? I would say you're not being very introspective or are just exaggerating.

kimixa · 2025-11-24T05:20:08 1763961608

I think it's very sector dependent.

Working on drivers, a relatively recent example is when we started looking at a "small" image corruption issue in some really specific cases, that slowly spidered out to what was fundamentally a hardware bug affecting an entire class of possible situations, it was just this one case happened to be noticed first.

There was even talk about a hardware ECO at points during this, though an acceptable workaround was eventually found.

I could never have predicted that when I started working on it, and it seemed every time we thought we'd got a decent idea about what was happening even more was revealed.

And then there's been many other issues when you fall onto the cause pretty much instantly and a trivial fix can be completed and in testing faster than updating the bugtracker with an estimate.

True there's probably a decent amount, maybe even 50%, where you can probably have a decent guess after putting in some length of time and be correct within a factor of 2 or so, but I always felt the "long tail" was large enough to make that pretty damn inaccurate.

auggierose · 2025-11-24T10:35:06 1763980506

I can explain it to you. A bug description at the beginning is some observed behaviour that seems to be wrong. Now the process starts of UNDERSTANDING the bug. Once that process has concluded, it will be possible to make a rough guess of how long fixing it will take. Very often, the answer then is a minute or two, unless major rewrites are necessary. So, the problem is you cannot put an upfront bound on how long you need to understand the bug. Understanding can be a long winded process that includes trying to fix the bug in the process.

darkwater · 2025-11-24T10:55:16 1763981716

> A bug description at the beginning is some observed behaviour that seems to be wrong.

Or not. A bug description can also be a ticket from a fellow engineer who knows the problem space deeply and have an initial understanding of the bug, likely cause and possible problems. As always, it depends, and IME the kind of bugs that end up in those "bugathons" are the annoying "yeah I know about it, we need to fix it at some point because it's PITA".

auggierose · 2025-11-24T11:52:31 1763985151

That just means that somebody else has already started the process of understanding the bug, without finishing it. So what?

darkwater · 2025-11-24T12:24:36 1763987076

So you can know before starting to work on the ticket if it's a few minutes boring job, if it could take hours or days or if it's going to be something bigger.

I can understand the "I don't do estimates" mantra for bigger projects, but ballpark estimations for bugs - even if you can be wrong in the end - should not be labelled as 100% impossible all the times.

auggierose · 2025-11-24T12:42:21 1763988141

Why did the other developer who passed you the bug not make an estimate then?

I understand the urge to quantify something that is impossible to quantify beforehand. There is nothing wrong with making a guess, but people who don't understand my argument usually also don't understand the meaning of "guess". A guess is something based on my current understanding, and as that may change substantially, my guess may also change substantially.

I can make a guess right now on any bug I will ever encounter, based on my past experience: It will not take me more than a day to fix it. Happy?

com2kid · 2025-11-24T04:57:32 1763960252

My team once encountered a bug that was due to a supplier misstating the delay timing needed for a memory chip.

The timings we had in place worked, for most chips, but they failed for a small % of chips in the field. The failure was always exactly identical, the same memory address for corrupted, so it looked exactly like an invalid pointer access.

It took multiple engineers months of investigating to finally track down the root cause.

triyambakam · 2025-11-24T05:32:12 1763962332

But what was the original estimate? And even so I'm not saying it must be completely and always correct. I'm saying it seems wild to have no starting point, to simply give up.

com2kid · 2025-11-24T06:28:56 1763965736

Have you ever fixed random memory corruption in an OS without memory protection?

Best case you trap on memory access to an address if your debugger supports it (ours didn't). Worst case you go through every pointer that is known to access nearby memory and go over the code very very carefully.

Of course it doesn't have to be a nearby pointer, it can be any pointer anywhere in the code base causing the problem, you just hope it is a nearby pointer because the alternative is a needle in a haystack.

I forget how we did find the root cause, I think someone may have just guessed bit flip in a pointer (vs overrun) and then un-bit-flipped every one of the possible bits one by one (not that many, only a few MB of memory so not many active bits for pointers...) and seen what was nearby (figuring what the originally intended address of the pointer was) and started investigating what pointer it was originally supposed to be.

Then after confirming it was a bit flip you have to figure out why the hell a subset of your devices are reliably seeing the exact same bit flipped, once every few days.

So to answer your question, you get a bug (memory is being corrupted), you do an initial investigation, and then provide an estimate. That estimate can very well be "no way to tell".

The principal engineer on this particular project (Microsoft Band) had a strict 0 user impacting bugs rule. Accordingly, after one of my guys spend a couple weeks investigating, the principal engineer assigned one of the top firmware engineers in the world to track down this one bug and fix it. It took over a month.

snovv_crash · 2025-11-24T07:46:32 1763970392

This is why a test suite and mock application running on the host is so important. Tools like valgrind can be user to validate that you won't have any memory errors once you deploy to the platform that doesn't have protections against invalid accesses.

It wouldn't have caught your issue in this case. But it would have eliminated a huge part of the search space your embedded engineers had to explore while hunting down the bug.

com2kid · 2025-11-24T16:54:58 1764003298

Custom OS, cross compiling from Windows, using Arm's old C compiler so tools like valgrid weren't available to us.

Since it was embedded, no malloc. Everything being static allocations made the search possible in the first place.

This wasn't the only HW bug we found, ugh.

paulf38 · 2025-11-25T05:44:10 1764049450

Valgrind (and the sanitizers) are only as good as your test coverage.

Static analysis can cover all your code, though generally with a significant rate of false positives that you will need to analyse.

pyrale · 2025-11-24T08:29:29 1763972969

There is a divide in this job between people who can always provide an estimate but accept that it is sometimes wrong, and people who would prefer not to give an estimate because they know it’s more guess than analysis.

You seem to be in the first club, and the other poster in the second.

arethuza · 2025-11-24T09:23:39 1763976219

It rather depends on the environment in which you are working - if estimates are well estimates then there is probably little harm in guessing how long something might take to fix. However, some places treat "estimates" as binding commitments and then it could be risky to make any kind of guess because someone will hold you to it.

ChrisMarshallNY · 2025-11-24T12:52:57 1763988777

More than some places. Every place I've worked, has been a place where you estimate at your own peril. Even when the manager says "Don't worry. I won't hold you to it. Just give me a ballpark.", you are screwed.

I used to work for a Japanese company. When we'd have review meetings, each manager would have a small notebook on the table, in front of them.

Whenever a date was mentioned, they'd quickly write something down.

Those dates were never forgotten.

arethuza · 2025-11-24T13:42:17 1763991737

"Don't worry. I won't hold you to it. Just give me a ballpark."

Anytime someone says that you absolutely know they will treat whatever you say as being a commitment written in blood!