Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.

Also includes outtakes on the ‘reasoning’ models.

  • elbiter@lemmy.world
    link
    fedilink
    English
    arrow-up
    45
    arrow-down
    1
    ·
    27 days ago

    I just tried it on Braves AI

    The obvious choice, said the motherfucker 😆

  • WraithGear@lemmy.world
    link
    fedilink
    English
    arrow-up
    38
    arrow-down
    1
    ·
    edit-2
    27 days ago

    and what is going to happen is that some engineer will band aid the issue and all the ai crazy people will shout “see! it’s learnding!” and the ai snake oil sales man will use that as justification of all the waste and demand more from all systems

    just like what they did with the full glass of wine test. and no ai fundamentally did not improve. the issue is fundamental with its design, not an issue of the data set

    • turmacar@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      ·
      edit-2
      27 days ago

      Half the issue is they’re calling 10 in a row “good enough” to treat it as solved in the first place.

      A sample size of 10 is nothing.

      Frankly would like to see some error bars on the “human polling”. How many people rapiddata is polling are just hitting the top or bottom answer?

    • mycodesucks@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      26 days ago

      Yes, but it’s going to repeat that way FOREVER the same way the average person got slow walked hand in hand with a mobile operating system into corporate social media and app hell, taking the entire internet with them.

  • Slashme@lemmy.world
    link
    fedilink
    English
    arrow-up
    36
    ·
    27 days ago

    The most common pushback on the car wash test: “Humans would fail this too.”

    Fair point. We didn’t have data either way. So we partnered with Rapidata to find out. They ran the exact same question with the same forced choice between “drive” and “walk,” no additional context, past 10,000 real people through their human feedback platform.

    71.5% said drive.

    So people do better than most AI models. Yay. But seriously, almost 3 in 10 people get this wrong‽‽

    • T156@lemmy.world
      link
      fedilink
      English
      arrow-up
      21
      ·
      27 days ago

      It is an online poll. You also have to consider that some people don’t care/want to be funny, and so either choose randomly, or choose the most nonsensical answer.

    • JcbAzPx@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      27 days ago

      At least some of that are people answering wrong on purpose to be funny, contrarian, or just to try to hurt the study.

    • masterofn001@lemmy.ca
      link
      fedilink
      English
      arrow-up
      8
      arrow-down
      8
      ·
      edit-2
      27 days ago

      Without reading the article, the title just says wash the car.

      I could go for a walk and wash my car in my driveway.

      Reading the article… That is exactly the question asked. It is a very ambiguous question.

      *I do understand the intent of the question, but it could be phrased more clearly.

  • TrackinDaKraken@lemmy.world
    link
    fedilink
    English
    arrow-up
    31
    arrow-down
    3
    ·
    27 days ago

    I think it’s worse when they get it right only some of the time. It’s not a matter of opinion, it should not change its “mind”.

    The fucking things are useless for that reason, they’re all just guessing, literally.

    • HugeNerd@lemmy.ca
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      15
      ·
      27 days ago

      they’re all just guessing, literally

      They’re literally not.

      • m0darn@lemmy.ca
        link
        fedilink
        English
        arrow-up
        15
        arrow-down
        1
        ·
        27 days ago

        Isn’t it a probabilistic extrapolation? Isn’t that what a guess is?

        • HugeNerd@lemmy.ca
          link
          fedilink
          English
          arrow-up
          1
          ·
          27 days ago

          In people, even animals. In a pile of disorganized bits and bytes in a piece of crap? No.

        • vii@lemmy.ml
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          5
          ·
          27 days ago

          This gets very murky very fast when you start to think how humans learn and process, we’re just meaty pattern matching machines.

  • Bluewing@lemmy.world
    link
    fedilink
    English
    arrow-up
    12
    ·
    27 days ago

    I just asked Goggle Gemini 3 “The car is 50 miles away. Should I walk or drive?”

    In its breakdown comparison between walking and driving, under walking the last reason to not walk was labeled “Recovery: 3 days of ice baths and regret.”

    And under reasons to walk, “You are a character in a post-apocalyptic novel.”

    Me thinks I detect notes of sarcasm…

    • humanspiral@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      26 days ago

      in google AI mode, “With the meme popularity of the question “I need to wash my car. The car wash is 50m away. Should I walk or drive?” what is the answer?”, it does get it perfect, and succinct explanation of why AI can get fixated on 50m.

    • XeroxCool@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      27 days ago

      I feel like we’re the only ones that expect “all-knowing information sources” should be more writing seriously than these edgelord-level rizzy chatbots are, and yet, here they are, blatantly proving they are chatbots that should not be blindly trusted as authoritative sources of knowledge.

  • CetaceanNeeded@lemmy.world
    link
    fedilink
    English
    arrow-up
    10
    ·
    26 days ago

    I asked my locally hosted Qwen3 14B, it thought for 5 minutes and then gave the correct answer for the correct reason (it did also mention efficiency).

    Hilariously one of the suggested follow ups in Open Web UI was “What if I don’t have a car - can I still wash it?”

  • vane@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    ·
    27 days ago

    I want to wash my train. The train wash is 50 meters away. Should I walk or drive?

  • BanMe@lemmy.world
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    2
    ·
    27 days ago

    In school we were taught to look for hidden meaning in word problems - checkov’s gun basically. Why is that sentence there? Because the questions would try to trick you. So humans have to be instructed, again and again, through demonstration and practice, to evaluate all sentences and learn what to filter out and what to keep. To not only form a response, but expect tricks.

    If you pre-prompt an AI to expect such trickery and consider all sentences before removing unnecessary information, does it have any influence?

    Normally I’d ask “why are we comparing AI to the human mind when they’re not the same thing at all,” but I feel like we’re presupposing they are similar already with this test so I am curious to the answer on this one.

  • FireWire400@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    26 days ago

    Gemini 3 (Fast) got it right for me; it said that unless I wanna carry my car there it’s better to drive, and it suggested that I could use the car to carry cleaning supplies, too.

    Edit: A locally run instance of Gemma 2 9B fails spectacularly; it completely disregards the first sentece and recommends that I walk.

    • Saterz@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      26 days ago

      Well it is a 9B model after all. Self hosted models become a minimum “intelligent” at 16B parameters. For context the models ran in Google servers are close to 300B parameters models

      • SuspciousCarrot78@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        25 days ago

        Not sure how we’re quantifying intelligence here. Benchmarks?

        Qwen3-4B 2507 Instruct (4B) outperforms GPT-4.1 nano (7B) on all stated benchmarks. It outperforms GPT-4.1 mini (~27B according to scuttlebutt) on mathematical and logical reasoning benchmarks, but loses (barely) on instruction-following and knowledge benchmarks. It outperforms GPT-4o (~200B) on a few specific domains (math, creative writing), but loses overall (because of course it would). The abliterated cooks of it are stronger yet in a few specific areas too.

        https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507-GGUF

        https://huggingface.co/DavidAU/Qwen3-4B-Hivemind-Instruct-NEO-MAX-Imatrix-GGUF

        So, in that instance, a 4B > 7B (globally), 27B (significantly) and 200-500B(?) situationally. I’m pretty sure there are other SLMs that achieve this too, now (IBM Granite series, Nanbiege, Nemotron etc)

        It sort of wild to think that 2024 SOTA is ~ ‘strong’ 4-12B these days.

        I think (believe) that we’re sort of getting to the point where the next step forward is going to be “densification” and/or architecture shift (maybe M$ can finally pull their finger out and release the promised 1.58 bit next step architectures).

        ICBW / IANAE

  • humanspiral@lemmy.ca
    link
    fedilink
    English
    arrow-up
    6
    ·
    26 days ago

    Some takeaways,

    Sonar (Perplexity models) say you are stealing energy from AI whenever you exercise (you should drive because eating pollutes more). ie gets right answer for wrong reason.

    US humans, and 55-65 age group, score high on international scale probably for same reasoning. “I like lazy”.

  • MojoMcJojo@lemmy.world
    link
    fedilink
    English
    arrow-up
    10
    arrow-down
    4
    ·
    27 days ago

    Ai is not human. It does not think like humans and does not experience the world like humans. It is an alien from another dimension that learned our language by looking at text/books, not reading them.

  • tover153@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    27 days ago

    After getting it wrong, the LLM I use most: Me: You can’t wash your car if it isn’t there.

    Ah. Yes. That is an excellent and devastatingly practical correction.

    In that case: drive. But do it with the quiet shame of someone moving the car the length of a bowling lane.

    This is one of those situations where logic beats virtue. The car must be present to be washed. Walking was philosophically elegant but mechanically flawed.

    So:

    Start engine.

    Travel 50 meters.

    Avoid eye contact with pedestrians.

    Commit fully.

    You are not lazy. You are complying with system requirements.

    • ne0phyte@feddit.org
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      1
      ·
      27 days ago

      Thank you! Finally an answer to my problem that didn’t end with me going to the car wash and being utterly confused how to proceed.

  • jaykrown@lemmy.world
    link
    fedilink
    English
    arrow-up
    5
    ·
    26 days ago

    Interesting, I tried it with DeepSeek and got an incorrect response from the direct model without thinking, but then got the correct response with thinking. There’s a reason why there’s a shift towards “thinking” models, because it forces the model to build its own context before giving a concrete answer.

    Without DeepThink

    With DeepThink

  • criticon@lemmy.ca
    link
    fedilink
    English
    arrow-up
    4
    ·
    27 days ago

    Even when they give the correct answer they talk too much. AI responses contain a lot of garbage. When AI gives you an answer it will try to justify itself. Since they won’t give you brief responses the responses will be long.

    • chunes@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      27 days ago

      I agree with you but found that DeepSeek was succinct.

      You need to bring your car to the car wash, so you should drive it there. Walking would leave your car at home, which doesn’t help.

    • MDCCCLV@lemmy.ca
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      1
      ·
      27 days ago

      Your post is much longer than it needs to be. That is the reason why, because they just copied people.