DoD Chief Information Officer John Sherman, Dr. Craig Martell, DoD chief digital and artificial intelligence officer, and Air Force Lt. Gen. Robert J. Skinner, director of Defense Information Systems Agency, testify before a House Armed Services Subcommittee in Washington, D.C. March 22, 2024. (DoD photo by EJ Hersom)

WASHINGTON — To get a gimlet-eyed assessment of the actual capabilities of much-hyped generative artificial intelligences like ChatGPT, officials from the Pentagon’s Chief Data & AI Office said they will publish a “maturity model” in June.

“We’ve been working really hard to figure out where and when generative AI can be useful and where and when it’s gonna be dangerous,” the outgoing CDAO, Craig Martell, told the Cyber, Innovative Technologies, & Information Systems subcommittee of the House Armed Services Committee this morning. “We have a gap between the science and the marketing, and one of the things our organization is doing, [through its] Task Force Lima, is trying to rationalize that gap. We’re building what we’re calling a maturity model, very similar to the autonomous driving maturity model.”

That widely used framework rates the claims of car-makers on a scale from zero — a purely manual vehicle, like a Ford Model T — to five, a truly self-driving vehicle that needs no human intervention in any circumstances, a criterion that no real product has yet met.

RELATED: Artificial Stupidity: Fumbling The Handoff From AI To Human Control

For generative AI, Martell continued, “that’s a really useful model because people have claimed level five, but objectively speaking, we’re really at level three, with a couple folks doing some level four stuff.”

The problem with Large Language Models to date is that they produce plausible, even authoritative-sounding text that is nevertheless riddled with errors called “hallucinations” that only an expert in the subject matter can detect. That makes LLMs deceptively easy to use but terribly hard to use well.

“It’s extremely difficult. It takes a very high cognitive load to validate the output,” Martell said. “[Using AI] to replace experts and allow novices to replace experts — that’s where I think it’s dangerous. Where I think it’s going to be most effective is helping experts be better experts, or helping someone who knows their job well be better at the job that they know well.”

“I don’t know, Dr. Martell,” replied a skeptical Rep. Matt Gaetz, one of the GOP members of the subcommittee. “I find a lot of novices showing capability as experts when they’re able to access these language models.”

“If I can, sir,” Martell interjected anxiously, “it is extremely difficult to validate the output. … I’m totally on board, as long as there’s a way to easily check the output of the model, because hallucination hasn’t gone away yet. There’s lots of hope that hallucination will go away. There’s some research that says it won’t ever go away. That’s an empirical open question I think we need to really continue to pay attention to.

“If it’s difficult to validate output, then… I’m very uncomfortable with this,” Martell said.

Both Hands On The Wheel: Inside The Maturity Model

The day before Martell testified on the Hill, his chief technology officer, Bill Streilein, told the Potomac Officers Club’s annual conference on AI details about the development and timeline for the forthcoming maturity model.

Since the CDAO’s Task Force Lima launched last August, Streilein said, it’s been assessing over 200 potential “use cases” for generative AI submitted by organizations across the Defense Department. What they’re finding, he said, is that “the most promising use cases are those in the back office, where a lot of forms need to be filled out, a lot of documents need to be summarized.”

RELATED: Beyond ChatGPT: Experts say generative AI should write — but not execute — battle plans

“Another really important use case is the analyst,” he continued, because intelligence analysts are already experts in assessing incomplete and unreliable information, with doublechecking and verification built into their standard procedures.

As part of that process, CDAO went to industry to ask their help in assessing generative AIs — something that the private sector also has a big incentive to get right. “We released an RFI [Request For Information] in the fall and received over 35 proposals from industry on ways to instantiate this maturity model,” Streilein told the Potomac Officers conference. “As part of our symposium, which happened in February, we had a full day working session to discuss this maturity model.

“We will be releasing our first version, version 1.0 of the maturity model… at the end of June,” he continued. But it won’t end there: “We do anticipate iteration… It’s version 1.0 and we expect it will keep moving as the technology improves and also the Department becomes more familiar with generative AI.”

Streilein said 1.0 “will consist of a simple rubric of five levels that articulate how much the LLM autonomously takes care of accuracy and completeness,” previewing the framework Martell discussed with lawmakers. “It will consist of datasets against which the models can be compared, and it will consist of a process by which someone can leverage a model of a certain maturity level and bring it into their workflow.”

RELATED: 3 ways intel analysts are using artificial intelligence right now, according to an ex-official

Why is CDAO taking inspiration from the maturity model for so-called self-driving cars? To emphasize that the human can’t take a hands-off, faith-based approach to this technology.

“As a human who knows how to drive a car, if you know that the car is going to keep you in your lane or avoid obstacles, you’re still responsible for the other aspects of driving, [like] leaving the highway to go to another road,” Streilein said. “That’s sort of the inspiration for what we want in the LLM maturity model… to show people the LLM is not an oracle, its answers always have to be verified.”

Streilein said he’s is excited about generative AI and its potential, but he wants users to proceed carefully, with full awareness of the limits of LLMs.

“I think they’re amazing. I also think they’re dangerous, because they provide the very human-like interface to AI,” he said. “Not everyone has that understanding that they’re really just an algorithm predicting words based on context.”