Blockchain

AI health chatbots look confident, but citations break down

2026-05-13 23:31:28

## AI health chatbots look confident, but citations break down ![Ethereum market visual](https://coinalx.com/d/file/upload/raw_61dkjk-hero-1-20260513153103.jpg) On May 13, 2026, [Decrypt](https://decrypt.co/367689/half-ai-health-advice-wrong-seems-right) reported on a BMJ Open audit ([study](https://bmjopen.bmj.com/content/16/4/e112695?rss=1)) that tested five consumer chatbots on health questions. The paper, published on April 14, found that 49.6% of 250 responses were problematic, and the models returned no fully accurate reference list. That is the real warning sign: in health, a wrong answer is bad, but a wrong answer wrapped in confident language and shaky citations is worse. The study looked at Gemini, DeepSeek, Meta AI, ChatGPT and Grok. Each bot answered 50 prompts across cancer, vaccines, stem cells, nutrition and athletic performance, with two experts rating the outputs. Most of the chatbots did not refuse to answer. Only Meta AI declined twice. That matters because many users do not approach these tools like research software. They use them like search engines, which means the answer has to be understandable, sourceable and easy to verify in the moment. ### The headline number is useful, but the failure mode matters more The 49.6% figure is striking, yet it should not be read as a blanket verdict on every AI health question. The study used an adversarial framework, so it deliberately pushed the models toward misinformation-prone prompts. That makes the results useful as a stress test, not a casual average. It also explains why the weaker topics were nutrition, stem cells and athletic performance, while vaccines and cancer performed better. The deeper issue is citation integrity. The audit found a median reference completeness score of 40%, and even the strongest outputs still failed to produce a fully accurate reference list. In practice, that means a user can get an answer that sounds well supported while the underlying sources are incomplete, misnamed or fabricated. For health, that is not a small UI defect. It is part of the risk surface. - **Five chatbots were tested**: Gemini, DeepSeek, Meta AI, ChatGPT and Grok. - The dataset covered 250 responses across five health topics. - 49.6% of the answers were rated problematic. - Reference completeness was only 40% at the median. ## Why confidence is the dangerous part A chatbot that sounds uncertain is easy to discount. A chatbot that speaks fluently, cites sources and keeps going is harder to challenge. That is why the readability finding matters almost as much as the accuracy result. The study said all chatbot outputs were graded as difficult to read, roughly at a college sophomore-to-senior level. For non-experts, a dense answer can feel authoritative even when it is wrong or incomplete. This also explains why the model-to-model gap was not the main story. Grok generated more highly problematic answers, but the larger pattern was that every system had trouble when prompts moved away from clean consensus areas. In other words, the problem is not a single bad chatbot. It is a product class that can sound more certain than the evidence behind it. ## What a safer workflow looks like The practical response is not to treat AI health tools as useless. It is to treat them as a first pass, not the final layer. If the answer concerns diagnosis, dosage, treatment, supplements or conflicting symptoms, the next step is to verify the claim against a current primary source or official guidance. The more the answer depends on nuance, the less acceptable it is to rely on the chatbot alone. Two checks help more than generic caution: 1. Confirm whether the cited source really exists and matches the claim. 2. Check whether the topic is one where consensus is stable or still contested. That distinction matters because the study itself showed different performance by topic. Cancer and vaccines did better because the evidence base is more structured. Nutrition and athletic performance did worse because the boundary between evidence, trend and opinion is easier for a model to blur. The model does not know that blur is dangerous; it just keeps generating plausible text. The study is limited to five free-tier chatbots and an adversarial test design, so it should not be read as a universal failure rate for every real-world interaction. But it does show a clear boundary: when the answer needs source fidelity, health literacy and careful interpretation, the burden shifts back to the user or clinician. That is exactly where the weakest chatbot outputs are most likely to mislead. ### What would change the conclusion - Better citation scoring, with complete and accurate references. - Clearer refusal behavior when the model cannot verify a claim. - Simpler, lower-reading-level explanations for non-experts. - Public guidance that treats chatbot output as a draft, not a verdict. The editorial takeaway is simple. AI health advice fails most dangerously when it sounds polished enough to skip verification. The next improvement is not just higher accuracy. It is making the source layer trustworthy enough that the answer can be checked before it is believed. --- Author: [Alex Chen](https://x.com/AlexC0in) | Alex has followed blockchain technology since 2021, focusing on DeFi and on-chain data analysis Source: [decrypt.co](https://decrypt.co/367689/half-ai-health-advice-wrong-seems-right)

DISCLAIMER:

1. All content on this website (including but not limited to articles, data, charts, and analyses) is for general informational purposes only and does not constitute any form of investment advice, trading recommendation, or financial guidance.

2. Cryptocurrencies and digital assets are subject to extreme price volatility and high investment risk; you may lose part or all of your principal. Past performance does not predict future results.

3. The information on this website is based on sources we believe to be reliable, but we do not guarantee its accuracy, completeness, or timeliness. Any investment decisions made based on this website’s information are at your own risk.

4. We strongly recommend that you conduct your own thorough research and consult an independent, licensed financial advisor before making any investment decisions.