Responsible AI and Socrates
By Martin Schmalzried
Fellow AAIH Insights – Editorial Writer

The contents presented here are based on information provided by the authors and are intended for general informational purposes only. AAIH does not guarantee the accuracy, completeness, or reliability of the information. Views and opinions expressed are those of the authors and do not necessarily reflect our position or opinions. AAIH assumes no responsibility or liability for any errors or omissions in the content.
Responsible AI has become a growing concern in the wake of staggering advances in AI technologies. This has prompted a number of initiatives to ensure no harmful content can ever reach human users. Since then, safety measures on AI systems, notably LLMs, have been put to the test by researchers and ordinary users, attempting to “jailbreak” LLMs in disclosing restricted information, generating harmful content, or bypassing ethical safeguards designed to keep their outputs aligned with human values.
In AI alignment, one research strand proposes to use smaller LLMs to “oversee” larger ones. This approach, often called scalable oversight, aims to delegate the evaluation of a more capable model’s reasoning or outputs to a weaker but trusted system. The idea is that a smaller model, precisely because it is simpler and more interpretable, might serve as a reliable filter. Scalable oversight reframes responsible AI as a modern counterpart to the Socratic method: a continual testing of claims, a disciplined probing of reasoning, and a structural commitment to preventing unexamined outputs from gaining normative authority.
But perhaps a simpler approach to filtering content could be taken, as the Socratic method of questioning is highly complex. Instead, the Socratic “three sieves technique” could be leveraged. As the story goes, someone wanted to share gossip about a third party to Socrates, but Socrates asked the person three questions to which he had to answer by the affirmative before sharing his gossip: “Is the information true? Is the information necessary? Is the information good?”
Based on these three questions, three smaller specialized models could act as these three filters to decide whether the output from a highly sophisticated LLM should reach a human end-user.
The first question addresses the issue of AI hallucinations. A smaller model might, in this instance, be specialized or fine-tuned to check a body of text against well-sourced reliable knowledge, acting as a first filter.
The second question deals with necessity: is the information actually needed, is it useful for the user’s intent, task, or context? A smaller model could be designed to evaluate whether the content contributes meaningfully to the user’s goal, or whether it introduces irrelevant, excessive, or potentially harmful detail. This “necessity filter” would not judge truth but purpose, ensuring that only information serving a legitimate, constructive aim passes through. Recent AI research showed that reasoning and memory can be separated suggesting that a lightweight model, focused on goal-relevant reasoning rather than memorized facts, could selectively suppress unnecessary or unhelpful information without disrupting the broader coherence of the output.
Finally, the last question tackles harm directly: does the information potentially lead to someone being harmed? In this instance, a smaller model could be trained specifically on ethics, morality and human values, biased for examining the potential consequence of certain information on human well-being, at the individual and collective level.
Whether such a system would perform better or worse than existing safety layers and reinforced learning strategies remains to be seen. However, it is worth remembering that as regards alignment and responsible AI, we are not starting from scratch. Throughout our own history and development, philosophers, spiritual thinkers or theologians have been tackling the issue of “alignment” of human moral behaviour. Now is our chance to put their accumulated wisdom to practical use, for instance by translating time-tested moral frameworks, such as Socrates’ insistence on truth, necessity, and goodness, into a computational architecture that supports more responsible AI.
Author – Martin-Schmalzried, AAIH Insights – Editorial Writer

