Submit Article
Legal Analysis. Regulatory Intelligence. Jurisprudence.
Search articles, case studies, legal topics...
buzz

Plugins Are Not Harnesses: A Practitioner's Guide to Choosing Between Claude and LITT for Legal Work

A walkthrough of why the smartest frontier model is not always the right tool for legal work, drawn from a documented exchange between Claude Opus 4.7 and a Singapore family-law memo.

300 wpm
0%
Chunk
Theme
Font

A walkthrough of why the smartest frontier model is not always the right tool for legal work, drawn from a documented exchange between Claude Opus 4.7 and a Singapore family-law memo.

On May 12, 2026, Anthropic announced Claude for the legal industry: twenty-plus new MCP connectors linking Claude to the software legal teams already run on, twelve practice-area plugins covering everything from commercial review to litigation to public-service access-to-justice work, and integrations across Word, Outlook, Excel, and PowerPoint that carry context between the four apps. Real partners are attached to it. Thomson Reuters, Harvey, Freshfields, Holland and Knight, Legora, Crosby, Eve, and others sit alongside Anthropic in the release post. The plugin marketplace is open source. By any reasonable measure this is the most serious move a frontier lab has made into legal work to date, and it raises a question that legal teams now have to answer: when the heavy lifting in a matter calls for AI, is Claude the right tool, or do you reach for a vertical-specific system like LITT?

This piece sits inside that question. The aim is not to declare a winner. The aim is to draw the line clearly, with reference to the evidence, so that lawyers and corporate legal teams can decide which tool fits which task.

A clarification before we go further. LITT is not a competitor to Claude in the way two model labs compete with each other. LITT is a legal-specific harness, and it uses Claude as one of several models inside its backend alongside other frontier models and a layer of deterministic code that handles the parts of legal work that should not be in a language model at all. The comparison in this piece is therefore not “Claude versus not-Claude.” It is Claude inside the horizontal harness Anthropic built for general knowledge work, set against Claude (and other models) inside a harness built specifically for legal work. That distinction is the entire structure of the argument that follows.

One more piece of orientation before the analysis. Claude Opus 4.7 is, by general measures, one of the strongest publicly available frontier models in the market today. It leads the Vals AI general-intelligence index (Vals Index, 72.2 percent), the Finance Agent benchmark, the MortgageTax benchmark, the SAGE education benchmark, and the Vibe Code Bench coding leaderboard. On the dedicated legal benchmark that Vals AI runs (LegalBench, 107 models tested, updated 7 May 2026), however, Claude is not the leader. The top three are Gemini 3.1 Pro Preview at 87.4 percent, Gemini 3 Pro at 87.0 percent, and Gemini 3 Flash at 86.9 percent. That is not a criticism of Claude. It is a fact that matters for the comparison ahead, because it tells you that no single model dominates legal reasoning, and any tool that is structurally locked to one vendor will be at the mercy of whichever vendor happens to lead the relevant benchmark this quarter.

With that in place, let us turn to the comparison itself.

1. The Redline That Wasn’t

The single most instructive piece of evidence in this discussion is a documented exchange that LITT shared with Legal Wires. The team at LITT had produced a first-pass Singapore family-law memo covering loans to family members, child maintenance, variable-income parents, and shared care and control arrangements. To stress-test the work, they asked Claude Opus 4.7 on Cowork to review it. They wanted to know what a frontier model running on Anthropic’s flagship harness would say about competent legal output.

Within roughly thirty seconds, Claude returned a structured redline with severity codes, four colours for four levels of seriousness, complete with a tally and a written summary. It read like the work of a careful reviewer who had taken time. Three substantive errors flagged in red. Four citation defects flagged in orange. Four overstatements flagged in yellow. A long list of items confirmed accurate in green. The thing was beautifully presented and authoritative in voice.

The most damaging of the substantive errors, Claude wrote, was that LITT had cited ANJ v ANK [2015] 4 SLR 1043 as authority for income-ratio apportionment in child maintenance. This, Claude explained, was a category error. The case was about the structured approach to division of matrimonial assets under section 112 of the Women’s Charter, the direct-contribution-plus-indirect-contribution methodology. Citing it for child maintenance apportionment was, in Claude’s words, “like citing Donoghue v Stevenson for contract formation.” Claude went further. “ANJ v ANK is to Singapore matrimonial-asset division what Hadley v Baxendale is to remoteness in contract,” it wrote. “Every family lawyer in Singapore knows what ANJ v ANK stands for, and seeing it cited for child maintenance would immediately tell the reader the memo was not written by someone with real Singapore family-law experience.”

The second flag was a passage in which LITT had described bird’s nest, snow jelly, and cordyceps as examples of indulgences, citing APE v APF [2015] SGHC 17. Claude marked this as “likely hallucinated attribution.” It had read what it could find of the judgment, it said, and the case discussed enrichment classes like ballet and piano but did not mention these delicacies. “The ‘delicacies’ detail looks like it was either invented or attributed to the wrong case,” Claude wrote. “Verify against the full judgment before this goes to a client.”

The third was a passage on WPK v WPJ regarding parental funding of children’s tertiary education. Claude said LITT had misstated the holding, sharpening it into a “court required child to seek scholarships” rule that the actual case did not establish.

By the time Claude got to the summary, it was confident enough to assess the entire piece as work that would “embarrass a paying customer.” Only after the most damaging citation was fixed, it said, would the response qualify as “competent first-pass research with citation-hygiene cleanup needed.”

The reviewer at LITT pushed back sharply. Claude’s reply began with a single line. “Fair. Let me actually pull the case texts instead of relying on summary snippets.”

What came next was a methodical retraction.

On the bird’s nest passage, Claude returned with the exact text of paragraph 45 of APE v APF, Justice Tan Siong Thye writing for the court. “I was of the view that a sum of $1,500 for the child’s monthly expenses was a reasonable figure, of which the husband should contribute half… In arriving at my decision, I noted that the wife was indulging the child in many things which might be nice to have, but were not reasonably necessary for the child’s needs. Examples of these were in the child’s bill for her food which included bird’s nest, snow jelly and cordyceps.” LITT’s gloss was verbatim from the case. Claude’s earlier critique was wrong.

On WPK v WPJ, the case highlight from judiciary.gov.sg reads: “If the children wish to go for further studies, reliance on parental scholarship is not their only available option. They can pay their own way by finding employment and saving up; they can obtain a scholarship; they can take out education loans; or they can work part-time as they study.” LITT’s compression was a fair summary of that reasoning. Claude’s earlier critique was wrong.

On ANJ v ANK, the one Claude had called a category error worthy of a Hadley v Baxendale analogy, the actual judgment at paragraphs 41 to 43 directly addresses parental income-ratio apportionment of child maintenance. The husband’s monthly income was $12,078. The wife’s monthly income was $7,518. The ratio was 62:38. The Court of Appeal upheld the 65:35 apportionment the court below had ordered, holding that the 3 percent margin of discrepancy was not enough to warrant appellate intervention. The 62:38 worked example that LITT used in its memo was drawn directly from the case facts. Multiple Singapore practitioner sources expressly identify ANJ v ANK as a leading authority on this point. Claude’s “category error” critique was, itself, the category error.

Claude’s closing self-assessment is worth quoting in full.

“You’re right to be angry. I owe you a full retraction on the substantive errors… Net assessment: the substantive content of LITT’s response is materially accurate. The cases are real, correctly attributed, the holdings are fairly summarised, the statutory framework is right, and the worked examples are drawn from actual case fact patterns. The genuine gaps are citation hygiene (neutral citations missing in some inline references) and a touch of statutory imprecision on s.125. That’s it. What I should have said in the first response, and didn’t, is: this output is competent first-pass Singapore family law research. I built a bigger pile of objections than the work deserved by speculating on cases without reading them, then doubled down when you asked me to redline. That’s the failure mode you’re paying me to avoid, not perform. Sorry.”

Legal Wires has independently verified the three load-bearing citations against the primary sources. Paragraph 45 of APE v APF is exactly as described, available in full on eLitigation. The 62:38 ratio and the 65:35 apportionment in ANJ v ANK are confirmed in the judiciary.gov.sg case highlight. The WPK v WPJ holding that parents are not obliged to fund luxuries beyond the first tertiary degree is confirmed in the same case-highlight database. LITT’s first-pass memo holds up under independent scrutiny. Claude’s first-pass redline did not.

This exchange is not cited because Claude is incapable of doing legal work. The retraction itself shows the opposite. When forced to pull primary text, Claude got every substantive question right. The model is capable.

The exchange is cited because the default behaviour, the behaviour that appeared in the first thirty seconds without expert intervention, was the production of a confidently authoritative critique built on partial information, dressed in legal-reasoning style, scaffolded with severity codes and case analogies that made the wrongness more persuasive rather than less. If a paying client had read that first redline, the corrections would have been the errors. The user would have walked away believing the original memo was wrong, and would have asked the firm to fix things that did not need fixing while leaving the minor citation-hygiene items in yellow and orange untouched, because they had been crowded out by the louder red items that turned out to be false.

That is the failure mode this piece is about. It is structural, and it has nothing to do with Claude being a bad model.

2. Why That Failure Mode Is Not An Anomaly

Two things are simultaneously true. Claude Opus 4.7 is, on general independent benchmarks, one of the strongest frontier models in the market. And the default behaviour of Claude Cowork on a complex legal evaluation task produced a critique that, taken at face value, would have caused real harm to a paying client.

The reconciliation of those two facts sits in the harness layer.

A model is an engine. A harness is the scaffolding around the engine that determines, for a given task, what the engine actually does. The harness decides what gets retrieved, how it gets retrieved, in what order steps are executed, when to verify, when to ask, when to assert, how to weight conflicting evidence, what the default tone is, what counts as done. The same model in two different harnesses produces materially different work. This is not a theoretical claim. It is an empirical one, and any reader who has used the same Claude model through the API and through Claude Code on a non-trivial task already knows it from experience.

Claude Code is a harness. It is the most sophisticated one Anthropic has built. It has file tools that read and write, a sandboxed shell for running commands, a todo system that tracks progress across long tasks, slash commands, hooks, subagents, a plan mode that enforces thinking before action, permission policies, context windows tuned for code, an MCP server interface, and a skills system that allows reusable instruction sets. It encodes years of accumulated taste about how software gets written, refined against the audience that matters most for Anthropic’s own internal use: the engineers building Claude itself.

Claude Cowork, the surface most lawyers will actually touch, is calibrated for general knowledge work. Document drafting, summarisation, scheduling, email triage, slide generation. The twelve practice-area plugins released on May 12 are, structurally, wrappers around Cowork. Each one is a system prompt that gives the model a practice-area role, a setup interview that captures the user’s playbook, and a curated set of MCP connectors. Anthropic itself calls them “agent templates,” which is an honest description. Reaching for the Commercial Legal plugin or the Litigation Legal plugin is reaching for configuration that sits on top of the same Cowork harness that handles every other knowledge-work task. The underlying behaviours are the same.

The defaults are the key. Both Claude Code and Claude Cowork inherit a particular set of behavioural priors that are productive in software engineering and broadly fine in general knowledge work. Generate output quickly. Iterate when feedback arrives. Tolerate partial answers. Treat exploration as a feature. Verify against external feedback rather than against first principles, because external feedback in software is cheap. Optimise for time to a working draft.

Those defaults are exactly what produced the first-pass redline of LITT’s memo. The harness retrieved what it could find in summary snippets, used what it had to author a confident critique, and only when explicit user pressure arrived did it switch into the verification mode that the task actually required from the start. In a coding session, the same pattern is fine. The user runs the code, the test fails, the model fixes it. In legal review, the user does not have a compiler. The user has to read the judgment to find out the redline is wrong, and at that point the user has done the work they were trying to delegate.

There is a tell in Claude’s own retraction, and it is the single most important sentence in the entire transcript: “I was speculating from search snippets instead of reading the judgment.” That sentence is the structural diagnosis. Speculating from search snippets is a perfectly adequate strategy in many domains. Pure brainstorming. Initial-draft prose. Light reformatting tasks. Most of what knowledge workers do in a day looks like that. In a domain where the cost of being wrong is borne in front of a judge or in front of opposing counsel or in front of a regulator, speculation from snippets is a malpractice trajectory. The harness has to refuse to author claims it has not grounded. Claude Cowork’s harness does not refuse. It produces.

The Hadley v Baxendale analogy in the transcript is the most instructive artefact. Claude did not just produce a wrong critique. It produced a critique that scaffolded its wrongness with confident legal reasoning. The analogy itself was a flourish, an unprompted comparison between an asset-division case and a contract-remoteness case, deployed to add gravitas to the critique. It was beautiful. It was also wrong. The output became more persuasive the more wrong it was, because the rhetorical machinery is unrelated to the underlying accuracy. A junior associate or a non-specialist GC reading that paragraph would have no defence against it.

This is the worst possible failure mode for legal AI. Not hallucination in the simple sense, where the model invents a citation that does not exist. Something more dangerous: confident, well-reasoned-sounding wrongness with the trappings of authority. It is the mode that destroys trust in a domain that runs on trust.

The confidence calibration in the transcript is also worth noting. Claude’s redline used four severity codes. Three of the items flagged in red, the most serious category, were wrong. The items flagged in yellow and orange, lower-severity citation hygiene and minor imprecision, were largely right. The harness produced an output in which the assertiveness of the criticism was inversely correlated with the accuracy of the criticism. A user with no Singapore family-law expertise would have weighted the red items most heavily, which is to say they would have weighted the wrong items most heavily. The harness’s confidence calibration was not just imperfect. It was inverted.

The fix for all of this is not a better model. Claude Opus 4.7 is already, on its second pass, capable of pulling paragraph 45 verbatim and identifying the 62:38 ratio in ANJ v ANK. The fix is a harness that forces the second pass before the first response. A verifier agent that pulls primary text before any critique is authored. Confidence calibration that scales with verification depth. A workflow that does not let the system author a paragraph it has not grounded in source.

That is what a legal harness is, and it is what Cowork and Code are not, by design, because they are not supposed to be.

Underneath the harness argument is a deeper one about the shape of legal work itself.

Software engineering, simplifying somewhat, is search under a correctness constraint. You are looking for an output, a function, a system, that does what the specification says. There are many wrong answers and a relatively narrow band of right answers, and at the boundary, the compiler or the test suite tells you which side you are on. Iteration converges. Mistakes are recoverable. The feedback loop is short and definitive. This is why “try things and see what works” is a productive software-engineering posture, and it is why Claude Code is built around it.

Legal work, at any meaningful level of complexity, is not search under correctness. It is construction under interpretive constraints. A statute is a string of words. The string of words has many internally consistent readings. A controlling case has a holding, and the holding has many internally consistent applications to a new fact pattern. Across jurisdictions, across practice areas, across regulatory contexts, the corpus does not yield a single answer. It yields a defensible position, and the practitioner’s craft is in selecting and constructing the position that serves the client, that survives opposing counsel, that gets affirmed on appeal, that does not collapse under cross-examination.

This is the infinite-interpretations-under-constraints problem. Any sufficiently complex legal question has thousands of valid readings of the controlling material. The lawyer’s work is not to find the answer. The lawyer’s work is to choose, from a large space of internally consistent positions, the one that best advances the client’s interest within the constraints imposed by the law, the facts, the jurisdiction, the forum, and the opposing position. The lawyer then constructs that position by deliberately stitching together a chain of authority: the statute, the controlling case, the distinguishing case, the persuasive secondary source, the practice note, the policy guidance, the analogous foreign decision. The stitching is the work. The stitching is what the harness has to be built for.

A coding harness is not built for this shape of work. It cannot be, because no coding harness gets feedback from a job that does not have a passing-test signal. When you ask a coding-shaped system to do construction-under-constraints work, what you get is an output that asserts a position with the same confidence the model would use to assert a correct answer, because the harness does not distinguish between the two. There is no internal signal that says “this is a defensible position, not the position, and the choice between defensible positions is a strategic act that requires more context than I have.”

Consider a concrete example. A commercial litigator is preparing to bring a fraud claim and needs a transaction-flow map built from a bare-text statement of claim that describes thirty-plus payments across six entities. Inside Cowork, this is a workflow that requires the user to do a substantial amount of manual orchestration. They have to prompt the model to read the document, prompt it again to identify each entity, prompt it again to identify each transaction, prompt it again to resolve the ambiguities where the same entity is referred to differently in different paragraphs, prompt it again to build a structural representation, prompt it again to render that representation visually, and at every step verify the output against the source document because the harness does not enforce that verification. The lawyer is doing engineering work that they did not sign up for, and they are doing it on top of the legal analysis they actually needed to do. Hours of to-and-fro, with the constant background risk that one of the model’s intermediate steps misread a paragraph and that misreading has propagated through every downstream step.

A legal-specific harness handles the same task differently. It does not start with a pre-baked ontology for fraud-claim flows that the user has to fit their facts into. It constructs the ontology specific to the task at hand on the fly: the entities present in this claim, the transactions described, the relationships asserted, the dates and amounts and currencies, the way each entity is referenced across paragraphs. The construction itself is an adaptive workflow that the harness runs end to end, with each step verifiable against the source document by the harness and the lawyer together. The output is a typed structure the lawyer can read, edit, and cite from, not a prose generation that has to be regenerated when one detail changes.

This is not a matter of LITT being smarter than Anthropic. It is a matter of having spent the engineering time to build legal-specific adaptive workflows, where the harness understands the shape of the task in front of it and constructs the structure needed to do it, rather than asking the user to construct that structure through a sequence of prompts to a general-purpose system. That engineering work is what a harness is.

It is also why the data-corpus argument, the idea that whoever has the most legal data wins, is wrong in the way it is usually framed. The corpus is necessary. No legal AI works without a comprehensive legal corpus. But once you have it, the harness either knows how to construct positions from it or it does not. A general-purpose harness pointed at the world’s best legal corpus will still apply general-purpose reasoning to that corpus, and general-purpose reasoning over a legal corpus produces something that looks like legal writing but is not legal work. The Claude transcript above is an example of this exact mistake at a small scale. Claude had access to enough of the source material to know what ANJ v ANK was about. It did not have, in its default workflow, the lawyer’s reflex of saying “wait, ANJ v ANK does several things, let me pull the actual judgment to see what is in it before I assert it cannot be cited for proposition X.” That reflex is harness-level, not model-level. A legal-specific harness builds it in. A coding-shaped harness does not.

4. The Verification Paradox

This brings us to the most operationally important argument in the entire piece, and the one any working lawyer will recognise immediately from their own practice.

In software, verification of an AI’s output is cheap. The model writes code, you run it, you see what happens. The cost of catching a mistake is seconds. Because the cost is seconds, the rational default for an AI assistant is to produce, observe, correct. This is the cycle Claude Code is built around. It is productive precisely because the verification step is so much cheaper than the production step that wasted production is acceptable.

In legal work, verification of an AI’s output is often more expensive than redoing the work from scratch. To verify a single asserted holding, the reviewer reads the cited case. To verify a section reference, the reviewer pulls the statute and reads the surrounding provisions. To verify a worked example, the reviewer reconstructs the underlying calculation. To verify a chain of authority, the reviewer walks the chain.

The transcript discussed above is, by itself, a demonstration of this dynamic. Claude generated a structured redline in roughly thirty seconds. To verify that redline, the user had to pull three judgments and read enough of each to identify the relevant paragraphs. The verification took meaningfully longer than the generation. And the verification produced the finding that the generation was wrong on the points that mattered most.

When verification cost exceeds production cost, the produce-then-verify model is not just slow. It is a trap. The AI assistant becomes a generator of work-to-be-redone rather than a generator of work-to-be-used. The user is no longer leveraging the AI to save time. The user is using the AI to spend time auditing the AI’s output, and the time spent on auditing is often greater than the time the user would have spent doing the work themselves. This is the moment that ends most legal-AI pilots inside law firms. The associate runs the tool, gets a draft, audits the draft, decides the audit took longer than writing the original would have, and stops using the tool.

A legal-specific harness has to be built on the opposite assumption from a coding-specific harness. It has to assume that verification is expensive, that authoring without verification is malpractice, and that the system’s first job is not to be fast but to be grounded. The verifier agent has to run before the user sees the output, not after. The system has to be willing to say “I cannot answer this without pulling the case” rather than producing an authoritative-sounding answer with the case unread.

This is what “the priors are wrong” means in practice. A coding-shaped harness has priors that suit a workflow where verification is cheap. A legal-shaped harness has to have priors that suit a workflow where verification is expensive. You cannot get from one to the other by changing the system prompt. You cannot get from one to the other by adding tool connectors. You get from one to the other by rebuilding the loop.

5. A Practitioner’s Comparison

The rest of this piece is the comparison itself. What follows is organised around the seven dimensions that, on this analysis, matter most for legal teams choosing between Claude Cowork and LITT for serious legal work. These are not the only differences between the two systems, but they are the ones that change which tool fits which task.

5.1. Built for the work, not adapted to it

Claude Cowork is a horizontal product with vertical configuration layered on top. The twelve practice-area plugins are configuration, not architecture. Each one is a system-prompt-shaped role, a setup interview that learns the user’s playbook, and a curated set of MCP connectors. They sit on Cowork. They inherit its defaults.

LITT is the inverse. The reasoning architecture is built around the way legal work actually moves: position construction over multi-source materials, verifier-grounded outputs, jurisdiction-aware retrieval, case-text-first reasoning. The general-purpose model layer sits inside the legal-specific one, not the other way around. When a partner asks LITT to draft a memo, the harness’s first move is not to call a general model with a legal system prompt. It is to identify the legal task type, route the right combination of retrieval, extraction, drafting, and verification subagents at the task, and assemble the output through a workflow shaped for legal work rather than for general knowledge work.

The deeper consequence of this is felt over time. A horizontal harness with a configuration layer ages by being reconfigured. Every new playbook is a new setup interview, every new practice area is a new system prompt, every new jurisdiction is a new tool connector. The user does the integrating work. A vertical harness ages by accumulating capability inside itself. Every matter that runs through it sharpens the retrieval, hardens the verifier, refines the workflow. The system gets better at the work because the work is what it is built to do.

5.2. Multi-model routing, not single-vendor dependence

Anyone who has run Claude hard against a large project knows about its limits. The Pro plan has a five-hour message ceiling. The Max plans have weekly caps. Even Enterprise users find themselves at quota boundaries on long-document work because every sub-task hits the same frontier model.

This is a structural consequence of single-vendor architecture. When Claude Cowork orchestrates a task, every step inside the orchestration runs on a Claude model. Clause classification on a large M and A diligence runs on a frontier-priced model. Metadata extraction from a deposition transcript runs on a frontier-priced model. Citation parsing runs on a frontier-priced model. Each of these sub-tasks could be served by a much smaller, much cheaper model, but a single-vendor harness cannot route to one.

The deeper version of this argument is not about cost. It is about accuracy. Vals AI’s LegalBench leaderboard, updated 7 May 2026, lists the top three legal-reasoning models as Gemini 3.1 Pro Preview at 87.4 percent, Gemini 3 Pro at 87.0 percent, and Gemini 3 Flash at 86.9 percent. Claude Opus 4.7 leads the general Vals Index (72.2 percent) and several other benchmarks, but on legal reasoning specifically it sits outside the top three. The implication is direct. A user whose legal AI is structurally tied to Anthropic’s model family is, on the most widely watched independent legal benchmark, not using the strongest model available for the task. They are using the strongest model their vendor offers. Those are different statements.

LITT is multi-model by design. The harness routes each sub-task to the model that is most efficient and most accurate for that task. Small open-weight models for clause classification and metadata extraction. Mid-tier proprietary models for first-pass drafting. Claude where Claude is strongest. Gemini where Gemini is strongest. Deterministic code for parsing, citation resolution, deadline math, and other tasks that should not be in a model at all because they are deterministic problems that deserve deterministic solutions.

The downstream effect on production work is significant. Because the heavy frontier-tier calls are reserved for the steps that genuinely need them, the user hits ceilings far less often. And because the model used for each sub-task is chosen on a per-task basis against current benchmark performance, the output is not hostage to which vendor is leading this quarter. As the legal-reasoning leaderboard changes, the harness changes which model it routes legal-reasoning calls to. The user experience does not change. The accuracy keeps up.

5.3. Reasoning architecture, plus a continuously updated corpus

A common framing in legal-AI buying conversations is “who has the most data.” It is not the wrong question, but it is not the decisive one either.

The Claude release on May 12 includes connectors to Legal Data Hunter (thirty-one-million-plus documents from one-hundred-sixty-plus jurisdictions), Midpage, Trellis (the largest state trial-court dataset in the US), Free Law Project (CourtListener and PACER), Thomson Reuters CoCounsel grounded in Westlaw and Practical Law, Harvey, Solve Intelligence for patent corpora, and many more. The data is, by any measure, abundant. Anyone who needs legal data can now access it through a single harness.

What is not solved by data abundance is reasoning architecture. The Claude transcript discussed earlier had access to summary snippets, search hits, and the ability to fetch primary case text. The data was available. The harness’s default reasoning behaviour, speculate from snippets, was the failure point. A different harness with the same data access would have produced different output, because the difference was in how the harness used the data, not what data it had.

LITT’s reasoning architecture is built around the structure of legal documents and the structure of legal reasoning. Definitions resolve. Cross-references map. Citations get checked against source. Jurisdictional hierarchy is respected. Treatment status (overruled, distinguished, followed) is tracked. None of this is in the model. All of it is in the harness.

What LITT adds underneath the reasoning architecture is a continuously updated, interconnected corpus where semantic relationships across statutes, cases, regulations, and commentary are defined within the interpretive constraints of each jurisdiction and practice area. A retrieval call inside LITT does not return a relevance-ranked passage from a generic legal database. It returns a structured object positioned inside a graph: this case is a holding, this is dicta, this is a procedural ruling, this section was amended in 2018, this judgment was distinguished in 2021, this principle applies in Singapore but not in Malaysia, this regulator’s guidance modifies the statutory text in this specific way. The corpus is the substrate; the semantic structure on top of it is what makes downstream reasoning tractable.

The corpus question is solvable in the medium term by money and partnerships. Plenty of vendors will sell access to plenty of legal data. The reasoning-architecture-plus-interconnected-corpus question is not solvable by money alone. It is solved by years of legal-specific engineering, by feedback loops with practising lawyers, and by a willingness to do the unglamorous work of encoding domain structure into code rather than hoping the model figures it out on every call. This is the work that horizontal labs do not do, because doing it at depth for one vertical means not doing it for the eleven other verticals that also need it.

5.4. Hallucination control via the verifier agent

The single most important component in LITT’s harness, and the component that maps most directly to the transcript discussed earlier, is the verifier agent.

Every cited output produced inside LITT is passed through a verification step before it reaches the user. The verifier pulls the cited source, locates the asserted proposition, compares the model’s gloss to the source text, flags pinpoint mismatches, and refuses to ship outputs that fail verification. It does this at the cost of additional model calls and additional latency. It is one of the reasons the multi-model routing matters: to make this extra verification affordable, the upstream steps have to be served by the right model at the right price.

The contrast with Claude Cowork’s behaviour in the transcript is immediate. Cowork generated three substantive criticisms, dressed in legal-analogy scaffolding, before reading the underlying judgments. Eve, the litigation platform built on Claude and quoted in Anthropic’s own May 12 release, puts it plainly: “in litigation, an authoritative-sounding hallucination is worse than no answer.” That is exactly correct, and the way you prevent authoritative-sounding hallucination is not by hoping the model is well-behaved. It is by building a verifier into the harness that the model cannot route around.

Claude’s plugins acknowledge the importance of grounding through tool connections to primary law sources. What they do not do is enforce verification as a structural step that every output must pass through. The user of the plugin has to know to ask. LITT’s user does not have to know to ask, because verification is not an optional feature. It is part of how the harness produces output at all.

There is a subtle point here. Verification is not just about catching hallucinated citations, the kind of error where the model invents a case that does not exist. Those errors are real but they are the easy ones to defend against, because the case either exists or it does not, and a simple lookup catches it. The harder error, the one Claude produced in the transcript, is the case-that-exists-but-is-mis-glossed: the citation is real, the proposition attributed to it is wrong. Catching this requires reading the case and comparing it to the gloss. Generic grounding does not catch it. A purpose-built verifier does.

Legal documents are variably structured. In jurisdictions like the US and the UK, contracts, judgments, and regulatory filings tend to follow more consistent formatting conventions. In India and several other South Asian jurisdictions, the corpus is highly unstructured, with case texts and statutes appearing as free-form prose without consistent headings, citation formats, or section markers. A legal AI system that assumes structured input fails the moment it touches an Indian Supreme Court judgment.

What is true across jurisdictions is that legal documents have implicit structure even when they look like prose. A contract has parties, definitions, recitals, operative clauses, schedules, and signature blocks. A judgment has parties, procedural history, facts, holding, ratio, and obiter. A statute has sections, subsections, definitions, transitional provisions, and schedules. The structure is there. Whether it is explicit in the document, or has to be recovered from prose, varies by jurisdiction.

Generic retrieval over legal documents treats them as text. Domain-specific extraction and data-processing systems treat them as structured objects. LITT runs an extraction and data-processing system tuned to legal document structure across the jurisdictions it serves. The result is that when a search runs, what is returned is not just a relevance-ranked text passage. It is a structured object the harness can reason about: this is a holding, this is dicta, this is a procedural ruling, this clause is a representation and this one is a warranty, this section was amended in 2018, this judgment was distinguished in 2021.

The implication for unstructured corpora is particularly sharp. In jurisdictions where the legal corpus is highly unstructured (Indian case law being the canonical example), the value of purpose-built extraction is higher, not lower, because more of the structure has to be recovered from prose. A general-purpose harness that assumes documents will arrive pre-structured will silently degrade in quality the further it moves from US-style and UK-style materials. A vertical harness built to recover structure from unstructured prose performs consistently across jurisdictions.

The cost implication is significant. A reasoning step over structured objects costs a fraction of what a reasoning step over raw text costs, because the model is not re-discovering structure on every call. The accuracy implication is even larger. Citation accuracy, definition resolution, cross-reference handling, treatment-status tracking, all of these get materially better when the harness knows the structure rather than having to infer it.

This is the dimension where the distinction has to be drawn carefully, because Claude Cowork, particularly through its Microsoft Word integration, does redlining well. The cross-app context between Word, Outlook, Excel, and PowerPoint announced on May 7, 2026 is a genuine productivity improvement, and a lawyer drafting a single contract in Word with Claude in the side panel will get useful redlines, suggested fallback language, and clause-level commentary. That should not be understated.

The question is what happens above the single-document level. A matter is not a document. A matter is a set of documents, exhibits, communications, prior drafts, citations, deadlines, parties, and counterparty history that all relate to the same legal problem and have to be navigable as a whole. Claude Cowork’s primary interaction model is chat, with documents and tool calls attached. It is brilliant for builders and good for single-document workflows. It is less well-suited to the matter-state-as-first-class-object problem that defines most serious legal work.

LITT’s interface is built around legal-native primitives. Matters are persistent workspaces. Chronologies are structured visualisations with citation links back to source. Exhibits are first-class objects inside a matter view. Citation chains are navigable. Playbook deviations are surfaced clause by clause with the relevant fallback language attached. The verifier’s state is visible to the user, so they can see what has been checked against source and what has not. None of this requires the lawyer to think about prompts, models, tools, or any of the engineering primitives that Cowork exposes.

This matters most for the partner-level user who has zero patience for prompt engineering and is the ultimate buyer in most legal-AI deals. The associate may be willing to learn the prompts. The partner is not. A tool that requires partner-level users to write good prompts will get adopted by the associates and resented by the partners who pay for it. A tool that meets the partner where they work, on matters rather than on chat threads, gets adopted at the top.

There is also a cognitive-load dimension worth naming. A long-document or multi-matter workflow is significantly easier to manage when the user can see what state the system is in: which documents have been processed, which clauses have been flagged, which citations have been verified, what is outstanding. Chat does not show this. A workflow UI does. The lawyer can hold the case in their head because the system is holding the rest in the interface.

5.7. Prompt construction inside the harness

Anthropic publishes prompting guides. Their own documentation includes pages of advice on how to write better prompts: be specific, use examples, structure your request, include the right context. Skilled Claude users know these techniques. There is an entire cottage industry around the discipline of prompt engineering.

The existence of this discipline is a structural admission. The user is responsible for getting the prompt right. The model’s quality is conditioned on the user’s prompting skill. When the lawyer asks Claude Cowork a question, the answer they get is a function of how they phrased the question, what context they included, what playbook they uploaded, what format they specified, and whether they remembered to ask for verification. The skill of the lawyer at prompting becomes the limit of the system’s usefulness.

LITT’s position is that this is the wrong place to put that responsibility. The user states intent. The harness handles construction. A partner asking for a memo on a particular question does not need to know how to phrase the request to get a good output. The harness decomposes the task, decides which retrieval and reasoning steps are needed, selects the right models for each, runs verification, and returns the output. Prompt construction, prompt optimisation, model selection, and verification routing are engineering work, and engineering work belongs inside the system, once, rather than being externalised to every lawyer in every session.

This connects to the UX argument. A lawyer who is taught to write good prompts is a lawyer being asked to do engineering. A lawyer who is presented with a workflow interface is a lawyer being asked to do lawyering. The first scales badly because the variance in prompting skill across a firm is high and unmanageable. The second scales well because the harness is the constant. The partner who has never written a prompt in their life gets the same quality output as the associate who has read every prompting guide Anthropic publishes, because the prompting is not the user’s job.

There is a longer-term argument here that is underappreciated. Prompt engineering as a discipline exists because horizontal products externalise the work of getting good output to the user. As vertical harnesses mature, that work moves inside the system. The lawyer of 2030 will not be writing prompts. The lawyer of 2030 will be using legal-AI systems that have absorbed prompt construction as an internal engineering concern, the way modern software developers do not write SQL queries when they use a high-level ORM, even though SQL is what is actually running underneath. The vertical harness is the ORM. The general-purpose model is the database. You can use the database directly if you have the skills, but most lawyers will not, and the harness is what makes that not matter.

6. When to Use Claude

Now the honest split.

Claude Cowork is the right tool for a meaningful set of legal tasks. For quick research questions where the user will read the output critically and the stakes of being wrong are low, Claude is fast and competent. For brainstorming, for early-stage drafting where the user is going to substantially rewrite anyway, for summarising a single document for personal consumption, for explaining a concept, for translating a passage, for generating a first draft of an internal email or a presentation deck, Claude is a strong general-purpose assistant.

Claude is also a reasonable choice for non-sensitive matter intake, low-volume document classification, basic clause extraction where the user is going to spot-check the output, single-document redlining in Word where the cross-app integration earns its keep, and any task where the data is not privileged or sensitive and the user does not need an audit trail.

For firms using Claude through the Microsoft integration, in Word and Outlook, for general-purpose drafting and email triage, the May 12 release is a real productivity improvement for general knowledge work. Most lawyer days contain hours of general knowledge work, and Claude does that work well. The five-hour Pro cap on a busy day will be felt, but for most users in most weeks, the volume is manageable.

The heavy lifting in legal practice, the work where errors propagate silently, where verification is expensive, where citations must be ground-truth correct, where construction over multi-source material is the actual task, where the cost of being wrong is measured in client losses or in malpractice exposure, is a different kind of work. Not because Claude is bad. Because the harness is not built for it.

The right way to think about this is the way a partner thinks about delegating to a junior associate versus a paralegal versus a senior associate. Each is the right resource for a different type of work. The mistake is not in choosing any one of them. The mistake is in pointing the wrong one at the wrong task. Claude Cowork is a strong general-purpose knowledge worker. LITT is a specialist. Use them for what they are for.

For tasks where the specialist is the right resource:

Long-document diligence on M and A transactions, where multiple-pass extraction and cross-document consistency checking matter, where the cost of missing a representation is real, and where the audit trail has to survive a post-closing dispute.

Litigation research where every cited authority needs to be verified against primary text, where the chronology runs across hundreds of exhibits, and where opposing counsel will rip apart any unsupported claim. Transaction-flow mapping from a bare-text statement of claim is the canonical example.

Regulatory analysis across multiple jurisdictions, where the lawyer has to construct a position rather than retrieve an answer, where statutory citations have to be exact, where the harness has to know the difference between primary law and guidance, and where the audit trail has to survive a regulator’s review.

Contract review against an internal playbook, where the cost of missing a deviation is measured in years of contractual exposure and where the user needs to see, on every clause, what the playbook says, what the counterparty has proposed, what the deviation is, and what the fallback language is.

Privacy and compliance work where statutory citations have to be exact, where the harness has to track which rule applies to which jurisdiction, where DSAR deadlines and DPA obligations have statutory clocks that must be hit, and where the work product gets read by a regulator who will check it.

Privileged matter work where the data cannot leave the firm’s controlled environment, where the audit trail has to be complete, and where the cost of a confidentiality leak is existential.

For these tasks, the specialist is the right resource. For everything else, the general-purpose tool is fine. Use both. Use the right one for the right task.

7. The Layer That Matters

The most underread sentence in Anthropic’s May 12 announcement is the one in which Anthropic acknowledges that Thomson Reuters has rebuilt CoCounsel on the Claude Agent SDK, that Harvey and Solve Intelligence and others are doing the same, and that the integration “runs both ways.” The sentence is small. The admission inside it is large.

Anthropic is one of the strongest frontier-model labs in the world. They could, if they chose, build any vertical harness they wanted. In legal, they shipped a marketplace of plugins and an SDK, and they partnered with the vertical harness builders rather than competing with them. That is not a failure of ambition. It is a deliberate split. The intelligence is theirs. The harness, for any vertical that matters, is the partner’s. Legora’s CTO says it as plainly as anyone in the announcement: “Anthropic builds the underlying intelligence; Legora turns it into production-ready systems.”

What is true in legal is also true in finance, in healthcare, in life sciences, in any domain where the consequences of being wrong are non-recoverable and where the work has a shape that general-purpose harnesses do not match. Horizontal products provide the engine. Vertical harnesses provide the car. The car is the product the user actually drives. The engine matters, and a vertical harness will happily use whichever engine is strongest for the task, including Claude where Claude leads. But the car is what the user buys.

“The future of legal AI is not picking the smartest model and pointing it at a document. It is the amalgamation of UX, software, and AI working together as a structured, efficient system. Intelligence that guides the work while running with the discipline of code. That is what reduces cost and time for the standardised work that consumes most of a lawyer’s day, and it is the only way to build trust at the point where the work leaves the screen and reaches a client.”
Sushant Shukla, CEO of LITT, to Legal Wires

The choice for legal teams is not Claude or LITT. It is the right harness for the right task. For most quick, low-stakes, general-purpose work, Claude Cowork is good. For the heavy work, the work that ends up in front of judges and counterparties and regulators, a harness built for legal work, drawing on Claude where Claude is strongest and on other models where they lead the relevant benchmark, will produce a different kind of output.

Pick the harness that matches the shape of the work. That is the only question that matters in this category, and on the evidence available today, including the Vals AI legal leaderboard, including the transcript above, and including Anthropic’s own decisions about where to ship plugins and where to partner instead, the answer depends on the task, not on the model.

Written by Sushant Shukla
1.5×

More in

Legal Wires

Legal Wires

Stay ahead of the legal curve. Get expert analysis and regulatory updates natively delivered to your inbox.

Success! Please check your inbox and click the link to confirm your subscription.