It’s actually quite a lot worse than even doctors in training except for highly constrained experimental settings and a few very nice applications that are mostly too tedious/impractical for a human to do or are very basic detection tasks.
I am a radiologist and researcher predominately focused on AI.
I work with pathologists and radiology is way ahead of us with AI use in clinical setting (but still not very far). Only things that get serious use are lab-developed (ie not commercial) image analysis algorithms for very limited (tedious, error-prone and ultimately not that often used) biomarkers. Don't believe the hype.
You could also look at the market, one of the biggest players, Paige, was acquired for about 30% of the money they raised.
I don’t think so, not beyond the current trend in medicine which is going up anyway.
For some things, like 3D volume segmentation of structure or disease (e.g. CVA/stroke volume, cardiac muscle mass, iron quantification) the bottleneck is the time it takes so we currently use approximations like single longest dimension, circular regions of interest, etc. AI will dramatically increase accuracy allowing for more accurate treatment and easier large scale research with quantitative endpoints.
Other things people think of like detection of aneurysms, fracture, lung nodules are not “hard” but AI has already added and will continue to add the second-reader benefit which will reduce detection errors. For this category the clinical benefit is as of yet unclear and we know that increased detection does not necessarily translate into improved patient outcomes and can in fact make them worse from over-diagnosis which means investigation related harms and over-treatment.
We were already in a phase of “over detection” in much of radiology with advances in imaging technology so the incremental benefit of current AI remains to be seen and I personally think is going to be much more limited. I had a case recently where a 2 mm brain aneurysm was missed on 3 CT scans over 10 years but was picked up by AI so now is being followed annually. This is too small to treat considering the risks and a serious argument could be made that 10 years of stability is proof enough that this is almost certainly clinically irrelevant for this patient.
Far more interesting areas of AI in imaging are in acquisition of acceleration (i.e. the medical equivalent of upscaling) which can dramatically decrease costs and increase accessibility as well as analyzing imperceptible features.
It may not be a popular take here but in my opinion the future of radiology is like what we see in software engineering today - a skilled human equipped with AI will outperform humans without AI and AI without humans, the latter of which we are still several years away from prototyping due to various technical hurdles.
> in my opinion the future of radiology is like what we see in software engineering today - a skilled human equipped with AI will outperform humans without AI and AI without humans
I suspect this will be the case across the board. It's a useful tool, but it's just a tool. It's not a replacement.
A friend of mine, a dermatologist, told me that LLMs are quite performant for melanoma analysis. Based on their own statistics, LLMs are able to beat humans with ~10 years of experience in the field.
They will never beat the human instinct tho, but they can be great tools sometimes. Unfortunately, LLMs mostly produce garbage.
Whenever it comes to medical diagnosis I would caution anyone to be careful with what “beat humans” really means.
In real life pathology is a spectrum not a binary and physicians are not trained to be 100% accurate instead optimizing sensitivity and specificity considering pretest probability as well as the harms of overdiagnosis and under diagnosis for a given scenario.
For something like melanoma which is relatively easy to diagnose with a superficial, extremely low risk skin biopsy and where early staging dramatically improves outcomes you would want to design around overcalling (high sensitivity) rather than maximize accuracy given the significant harms with false negatives and minimal harms with false positives.
An AI may be more accurate at classifying melanoma/not melanoma but if it does not meaningfully improve on the clinical threshold of biopsy/no biopsy or result in less biopsies that accuracy is wasted and may even be detrimental.
Note: I am just using this as an example to illustrate the considerations.
The better question is are there any sources that AI is better than human readers? I haven’t heard anyone make this claim outside of single/few disease classification tasks and even those are mostly 2D.
Anecdotally, my practice has most FDA approved AI deployed as we are an evaluation site and very rarely is the AI result useful. Over the past few months we have been cancelling contracts as these cost quite a lot of money (in some cases eating >50% of the study interpretation cost) for little to no benefit and a LOT of noise.
I think you’re overstating the impact of interpretability here. Your earlier point that latent reasoning models can’t be trained very well and that discretization may be load bearing rather than a readability tax in addition to significant inference infra hurdles (e.g. batching, speculative decoding) have limited any serious attempts and reduced the theoretical advantage over CoT at least in the near term.
> I think you’re overstating the impact of interpretability here
Outside of RLAIF, interpretability is the strongest way to do alignment right now. alignment is important because otherwise LLMs are incentivized to learn power seeking, dangerous behaviours [1]. a more downto earth example of alignment being important is that agents are incentivized to do tasks in the shortest way possible, and this way might not be what the user wants (I explain this further in another comment in this thread)
You’re putting the cart before the horse - alignment is an unsolved challenge (there are proposed approaches and active research on this) but it is still not established (beyond theory) that latent reasoning is more capable than CoT on hard language reasoning, particularly at scale.
I’m also reminded by the early COVID days when exponential growth was leading to predictions of the collapse of modern civilization and a billion dead, now it’s just another endemic respiratory virus.
Yeah! Just like they warned us that Y2K was gonna cause a lot of problems, and then a bunch of people did a bunch of work and then that problems didn't happen, so those people warning us about Y2k were wrong!
> The current average flight time from NYC to SF is 6.7 hours.
What's your source for this? I take this flight a lot and I find it hard to believe it's more than 5.5-5.75 on average. Looking at the last few weeks for one of them[0] supports my experience.
ChatGPT will actually look at your whole medical history, listen to you, think and check multiple different options before making a decision. You can spend hours chatting with it back and forth.
An average human doctor has maybe 15 minutes allotted to getting to know you, analyse and determine a course of action. Which is usually "take some ibuprofen and let's see if it goes away". Then you go again in two weeks with the same thing, it's a different doctor and the context has been reset unless you do an info dump from the previous visits and try not to forget anything.
And if you infodump too much or use actual medical diagnosis terms, the Dr gets defensive because you're stepping on THEIR area of expertise and will start pushing back even from the obvious just because they can.
I wonder if in your case (which is very common) the issue is a mismatch between expectations and reality. The medical system as we know is not designed for someone to listen to you and do a back and forth for hours. If we did that we would only treat 2-4 patients a day. It’s also not particularly helpful.
Time spent in a medical encounter is tied to patient satisfaction but there is rapid drop off for clinical benefit especially in the current day where investigations are more important than a physical exam in most cases and more than history in a substantial portion.
15 minutes is what we book as follow-ups or minor assessments in US+Canada, usually sufficient for most things. New consults or complex patients are 30-60 minutes.
Infodumping is not particularly helpful. Doctors are trained to use a combination of open and closed questions to guide the encounter based on their thinking and understanding of medicine. It’s relevant past medical history as not every symptom or past disease is necessarily useful in assessing what’s wrong today.
> has maybe 15 minutes allotted to getting to know you [...] Then you go again in two weeks with the same thing, it's a different doctor and the context has been reset
This is not how doctors work in most of the world. Not having an actual primary care physician that is able to keep track of each patient over multiple years means they are skipping out on one of their most important duties. You should advocate for a better standard of care rather than resorting to hallucinating chatbots.
> Is the blood pressure really being measured "correctly" in all those studies? Or not?
Probably incorrect in most studies, especially large population ones that influence treatment guidelines.
It’s academic and doesn’t practically matter though.
The pathogenesis of hypertension related disorders (kidney failure, heart failure, stroke etc) is well known.
It’s not in doubt that sustained hypertension is bad, that there is increased risk with higher blood pressure and that patients with high blood pressure undergoing treatment suffer less of these bad outcomes.
Potential harm is always the same - misdiagnosis and/or mismanagement.
It’s probably very low in the context of CGM and diabetes as the potentially harmful treatments require prescriptions.
Device prescription requirements are usually due to product labelling and the manufacturers application. There are OTC fingerstick glucometers and CGMs approved.
I am a radiologist and researcher predominately focused on AI.
reply