Nathan Helmly, NSWCDD Computer Scientist, builds an algorithm with his team at the NSWCDD Innovation Lab’s wargaming hackathon on Aug. 11, 2021. (Caleb Gardner / Naval Surface Warfare Center Dahlgren Division)

BALTIMORE — The Pentagon’s IT arm, the Defense Information Systems Agency, has seen quite enough AI hype, thank you very much. So at the agency’s annual Forecast to Industry here on Monday, new DISA director Lt. Gen. Paul Stanton said he needs clarity from vendors on how their offerings’ AI features actually work and what data they were trained on.

“If vendors are incorporating artificial intelligence … they need to be able to explain the details of how and why they’re using it,” Stanton told reporters at a roundtable during the industry day. “AI requires context. You don’t just sprinkle AI on a problem.”

“Don’t do it for the joy of doing AI …  just to say that you have it,” added the agency’s chief technology officer, Steve Wallace. “We’ve seen a lot of vendors who claim to have it, and when you peel back the onion, there’s not a whole lot of depth there.”

In particular, DISA is not interested in ChatGPT clones. The Air Force has already developed a chatbot for Defense Department use, NIPRGPT, Wallace said, so there’s no need to duplicate that capability: “We don’t need 60 GPTs running around the Department.”

Instead, Stanton, Wallace, and other agency leaders emphasized, DISA needs AI built to serve specific Defense Department missions. They also need to understand — and have contractual rights to — the underlying data on which that AI is trained.

“Being very, very careful and thoughtful about how we acquire, leverage and analytically resolve the right data is important,” Stanton said. “That’s at the heart of our business as an agency.”

One way that focus on the underlying data manifests, said Douglas Packard, DISA’s director of contracting and procurement, is that the agency’s rights to data are absolutely going to come up in contract negotiations.

“Don’t create AI without thinking I’m going to need the data rights to it,” he said.

In many negotiations, “people spend weeks on data rights,” Packard went on. “Our biggest negotiation item [is] always the data rights, because we [may] need to give it to someone else” to enable competition between contractors and avoid “vendor lock.”

DISA also wants to understand the data used by AI tools so it can assure itself that foundation is sound. Stanton and his team want AI trained on relevant, authoritative, and carefully curated data — not random Reddit posts or other dubious datasets built from indiscriminate trawls of the public internet.

“We don’t want our data derived from the entire corpus of the internet; there’s a lot of bad data,” Stanton said.

AI trained on inaccurate data — or even on accurate data mixed in with irrelevant information which the algorithm doesn’t know to ignore — can produce wildly erroneous answers. For example, Wallace said he asked a large language model how many directors DISA has had. “The answer I got was something like 1,700.” Since the agency was founded in 1991, just 33 years ago, that AI result would imply DISA got a new director almost every week.

RELATED: How new White House AI memo impacts, and restricts, the Pentagon

LLMs can also make subtler errors that are harder to detect, Wallace went on, as DISA discovered in recent internal testing of an “AI concierge” intended to help its staff sort through what policies and procedures they need to follow. “We learned a lot,” Wallace said.

For instance, when the agency issues a new policy that supersedes an older one, a human can simply look at the dates each document was issued to figure out which one is the latest guidance. But, it turned out, the LLM could not. As a result, the prototype AI concierge would give contradictory answers, sometimes applying the current policy and sometimes the obsolete one, with which it used often depending on exactly how a user phrased the question.

To protect against such AI errors and hallucinations, Wallace said, one essential safeguard is well-trained workforce that knows better than accept the algorithm’s output without question. With today’s widespread naiveté about AI, he said, “it’s not terribly different than when GPS came on the scene” and, out of blind trust, users would navigate to nowhere — or drive into a lake, he added, citing an episode of The Office. (Such navigation errors have happened in real life.)

But industry also needs to build better, more reliable algorithms that don’t hallucinate as often in the first place, Wallace and the other officials said. Large Language Models, in particular, need to generate their answers based on verified official sources, rather than the internet at large (using a technique known as Retrieval Augmented Generation, aka RAG). LLMs also need to provide links to specific sources that users can double-check, the officials said, rather than just output plausible but unsourced paragraphs, the way publicly available chatbots do.

“There’s some ongoing research that is really intriguing” on such “traceable” answers, Stanton said. Forcing the AI to footnote its work this way may require more processing power and data storage, he said. But properly curating datasets could also reduce costs, he said, by allowing AI to ignore vast amounts of junk data and just work with the data that actually matters.

“Help us understand what data is relevant … so we leverage the right data,” Stanton said. “Or else we’re going to be stuck in data centers we can’t afford, cranking away on datasets that have no relevancy to the problem we’re trying to solve.”