Context Before Autonomy
A position on coding agents, repository reality, and the missing layer
Coding agents arrived in our workflows before our repositories were ready to receive them. That is not a criticism of the teams who built those repositories. It is a description of how the adoption cycle actually unfolded.
The conversation around coding agents has been almost entirely about what agents can do: how capable they are, what tools they can access, how well they score on benchmarks, how quickly they complete a task. Almost nothing in that conversation has addressed the other side of the interaction. What does a repository need to communicate about itself before an agent starts changing code? That gap matters now, in hybrid workflows where a developer is present to catch a misframing before it lands. It will matter considerably more as delivery models move toward increasingly autonomous operation, where the distance between what an agent knew before it started and what the repository actually requires shows up at the pull request stage rather than during the task.
This is an attempt to name that gap, explain where it comes from, and make the case for a minimal, practical response.
A note on language: this paper uses terms like “understand,” “know,” “read,” and “interpret” to describe agent behavior. These are not claims about subjective experience or intent in any human sense. They describe functional outputs and processing behaviors, using the vocabulary available in human language because no cleaner alternative exists for discussing these dynamics. The mechanisms involved are fundamentally different from human cognition; the language is used to make the ideas accessible, not to claim equivalence.
What the Discourse Has Missed
The dominant frame in agentic engineering evaluates agents on their outputs: does the code compile, does it pass tests, does it match the intent of the request? That frame is useful and necessary. It is also incomplete.
The missing question is not “can the agent produce correct code?” It is “does the agent have enough context about this specific repository to know what correct looks like here, for this task, and how that context will change what the result actually needs to be?”
Those are different questions. A model that produces clean, idiomatic, well-tested code against a greenfield specification may produce operationally risky code against a long-lived product codebase with a history of defensive practice and carefully preserved architectural constraints. The difference is not model capability. The difference is context sufficiency.
Context sufficiency is beginning to receive attention, but not yet where it most needs to. Research teams at Amazon, ByteDance, and Google have made real progress on the technical problem of getting repository context into a model’s context window: retrieval-augmented generation, selective context injection, cross-file snippet extraction (Shrivastava et al., 2024; Liu et al., 2024; Google Cloud AI, 2024). That work is valuable. It addresses how to deliver context to an agent.
The tooling layer has moved in parallel. The introduction of standardized artifacts like AGENTS.md, .github/copilot-instructions.md, and their vendor-specific equivalents across Cursor, Windsurf, and other IDE harnesses represents the ecosystem acknowledging that agents need explicit, persistent repository guidance rather than only what they can infer from code. Practitioners are writing these files. Teams are discussing what belongs in them. That is real progress.
What neither layer has fully addressed is the harder question: which context matters and why the repository is in the state it is in.
The publicly accessible practitioner conversation has started moving in that direction. By early 2026, OpenAI’s own engineering team reported that context management had become “one of the biggest challenges in making agents effective at large and complex tasks” (OpenAI Engineering, 2026). Their experience trying to pack all repository guidance into a single large AGENTS.md is instructive: it failed in predictable ways, because volume is not the same as relevance. Their fix was structural, a short AGENTS.md as a table of contents pointing into a well-organized knowledge base, rather than a monolithic artifact. The lesson is not that context does not matter. It is that the wrong context, or too much of it without structure, can be as unhelpful as none.
That insight points toward the harder question this paper is trying to ask. The technical delivery of context to an agent is a solvable engineering problem. The more difficult problem is knowing what context to provide: what the repository needs to communicate about itself, why its current state is what it is, and how that shapes what the agent ought and ought not to do. That layer, the explanatory layer underneath the context window, has not been the center of the conversation. Model capability, benchmark performance, and retrieval mechanics are measurable and demonstrable. Repository history, delivery motivation, and the reasoning behind defensive patterns are not. So they tend to get left out, even as they shape the practical reality of agentic behavior in production systems.
The honest version of this is that the industry has been asking the more tractable questions. This paper is an attempt to ask the harder ones.
Defensive Patterns and Why They Exist
Before addressing what agents need, it is worth understanding the repositories they are entering.
Long-lived product codebases carry patterns that, judged against idealized engineering practice, often look like failures: duplication instead of shared abstraction, refactoring avoidance in fragile areas, minimal change scope to reduce blast radius, inconsistent testing around paths that were once or remain brittle, warning comments serving as institutional memory for decisions nobody wants to revisit. These behaviors appear across industries, across languages, and across team maturity levels. They are common enough to have a name. They are called defensive patterns.
The critical interpretive move is treating those patterns as evidence rather than failure.
Defensive patterns are not primarily a symptom of poor engineering culture. They are adaptations to real operating conditions and delivery motivations. A team that duplicates behavior rather than creating a shared abstraction in a system where breaking changes propagate unpredictably is not being lazy. They are rationally managing blast radius when neither the time nor the information needed to act differently is available to them. A team that avoids refactoring a fragile but critical path is not avoiding hard work. They are managing the asymmetry between the risk of a production incident and the benefit of a cleaner design. A team that leaves a warning comment rather than fixing an underlying issue may have tried to fix it once, paid a cost, and decided the comment was the safer form of institutional knowledge.
This is what delivery motivation means as an analytic concept. The state of a codebase at any point in time reflects the motivations of the organization shaping it. Solving the problem now versus engineering for maintainability, longevity, or a different risk profile leads to different accumulated implementation properties over time. Those properties are not random. They are responses to the incentive landscape the team actually navigates.
The reason this matters for agentic engineering is specific, and it plays out differently depending on who is entering the codebase.
A developer who has spent real time in a codebase absorbs this context intuitively. They know which paths are fragile, which abstractions have been tried and abandoned, which areas carry regulatory sensitivity, which refactors are off-limits not because of technical debt but because of organizational history. That knowledge is rarely written down. It lives in the team’s working memory and, imperfectly, in the defensive patterns themselves.
A developer arriving without that history faces a different version of the same dynamic. They may not know why a pattern exists, but they can read the signals: the warning comments, the conspicuously minimal change history in a critical file, the duplication that looks wrong but that nobody has touched in years. And there is a more immediate force at work: the awareness that they are the one who will be accountable if something breaks. That self-preservation instinct shapes how even an inexperienced developer moves through a codebase. They carry the defensive patterns forward not because they understand them, but because the cost of disturbing them falls on them personally. The signals say “be careful here,” and they are careful.
An agent occupies a different position entirely. It can read the same signals a developer reads: the warning comment, the untouched critical file, the duplication that persists despite being obviously redundant. What it lacks is the interpretive layer that gives those signals their weight. A warning comment is legible text. Empirical research into comment patterns across large repositories has documented this phenomenon extensively under the label of self-admitted technical debt: signals that developers embed directly in code to flag known issues, deliberate shortcuts, and constraints the team has chosen to preserve rather than resolve. Studies find these patterns present in up to a third of analyzed files, and introduced more frequently by experienced developers than by newcomers, a finding that reframes them as considered choices rather than careless ones (Potdar & Shihab, 2014; Maldonado & Shihab, 2015). Its defensive nature, the history behind it, the cost that was paid when someone ignored something similar, is not. An agent does not read a fragile path and feel the pull to leave it alone. It does not carry the accountability that makes a developer pause before changing something they do not fully understand. It has no institutional memory, no empathy for why the pattern exists, and no fear of being the one who breaks something. Without explicit context telling it what those signals mean and why they matter, it will act on what it can read, not on what it cannot.
For the class of repositories this paper describes, namely mature codebases under sustained delivery pressure that carry accumulated defensive patterns agents can read but not interpret, that is not a model failure. It is a context failure.
Context Helpers and What They Are
If defensive patterns are the accumulated reality the repository carries, context helpers are the mechanism for making that reality legible before an agent acts. The term used here, context helpers, is equivalent to what others in the practitioner community call agent READMEs, repository-level context files, or repository instructions. The naming is secondary; the function is the same.
A context helper is a structured artifact that provides a coding agent with explicit, curated information about a repository: information the agent cannot reliably infer from code structure or training data alone. The minimum viable set for most repositories is three artifacts:
README.md establishes orientation: purpose, setup, validation, and contribution basics. Its primary audience is human: the developer who needs to understand what the repository is for, how to get it running, and how to contribute. Agents can and do read it, and a well-maintained README provides useful orientation. But its scope is typically bounded by that human intent. It describes what the application is, how to retrieve and build it, and what the basic contribution expectations are. It does not describe the codebase’s accumulated reality: which areas are fragile, which decisions were pragmatic rather than principled, or what the current state of the system reflects about the organization that shaped it. Surveys of open-source and private production repositories suggest many READMEs may be thin, outdated, or written against an earlier version of the codebase rather than the one the agent is about to change. Presence is not the same as usability, and usability is not the same as sufficiency.
AGENTS.md is the current standard artifact for agent-facing guidance (AGENTS.md Format, 2025), and its intent in practice tends toward instruction: what the agent should do, how tasks should be approached, which patterns to follow, which to avoid. A review of 109 AGENTS.md files drawn from a sampled population of approximately 17,500 active repositories supports this. Only 116 of those repositories carried an AGENTS.md file at all, representing less than 1% of the active estate. Roughly 60% of the artifacts in that sample were classified as mixed, carrying at least some context alongside instruction, but the honest reading of that number is less encouraging: the majority of files in the mixed category state constraints without explaining them, and the substantive explanatory content, the kind that tells an agent why a pattern exists rather than simply that it does, concentrates in a small minority of long, hand-authored files. Basic repository purpose, meaning any description of what the repository is for and who it serves, was entirely absent in just over a third of files, including several product-named repositories where the file contained no reference to the product at all. Beyond content quality, adoption itself was near-floor: AGENTS.md files appeared in fewer than 1% of active repositories in the sample.
Note: this sample draws from a single private repository estate. It spans multiple regions, languages, and application domains, but it is not a cross-industry or open-source population. The patterns observed are directionally useful; they are not statistically representative of all engineering organizations at large.
That is valuable, as far as it goes. But instruction is not the same as context. An instruction tells the agent what to do in a given situation. Context tells the agent what kind of situation it is in.
Two published studies add empirical texture to what “as far as it goes” actually means in practice. One measured agent performance on the same tasks with and without AGENTS.md present and found that the artifact’s presence was associated with lower median runtime and reduced output token consumption (Lulla et al., 2026). That finding carries an important qualification: the study filtered its corpus to include only files that met a minimum content bar, retaining files that contained all three of conventions and best practices, architecture and project structure, and project description. The efficiency gains reflect substantive, curated files, not the general population of any file named AGENTS.md. A second study, evaluating context files across a larger task set under three conditions (no context, LLM-generated context, and developer-written context) found that context files tended to reduce task success rates compared to no context and raised inference cost by over 20%, attributing the degradation to unnecessary requirements that made tasks harder (Gloaguen et al., 2026). The two are not contradictory; they measure different outcomes. But taken together, they establish that what an artifact contains is at least as consequential as whether it exists. A context file that adds noise reduces performance. A context file that reduces noise improves it.
A separate repository reality artifact, which this paper proposes under the working name REPOSITORY.md, would carry what instruction alone cannot: not just where the fragile zones are, but why they are fragile, what was tried before, and what the team has accepted as a rational trade-off rather than an oversight. The scope of this artifact is not limited to defensive patterns. It encompasses any dimension of repository reality that requires interpretive context before an agent acts: regulatory boundaries, compliance constraints, data classification zones, architectural decisions that are not apparent from the code itself, domain-specific operational rules, and organizational constraints that predate the current team. The common thread is caution: context that tells an agent this is an area where generic best practice is insufficient and local knowledge is required.
These three artifacts form a layered context surface. The first establishes what the repository is. The second establishes how to operate in it. The third establishes what the repository actually is, as distinct from what it was designed to be or what it appears to be in isolation.
That last distinction is worth holding onto. A well-designed system and the system that accumulated over ten years of delivery pressure in a regulated environment are not the same thing, even if they share a codebase. Context helpers make that gap explicit. They do not ask the agent to pretend the gap does not exist. They tell the agent what the gap is and how to navigate it. As with any repository configuration, context artifacts carry a trust surface: incorrect or harmful instructions embedded in a context file will be followed in the absence of proper guardrails, which means the same care applied to code review applies to the artifacts that instruct agents.
Context Helpers Are Not Task Scaffolding
This distinction matters and it frequently gets collapsed.
Prompts, skills, workflows, agent instructions, and profiles are task scaffolding. They package recurring work into reusable entry points. They improve consistency and ergonomics. They have real value. But they tell an agent what work to do and how to approach that work in a general sense. They do not tell an agent what kind of space that work is happening in.
A well-configured prompt for “add a new API endpoint” does not tell the agent that this particular codebase has a fragile authentication middleware layer that has survived three failed refactoring attempts, or that a specific integration path bypasses standard validation for legacy reasons that the team accepts as a known trade-off. The agent may detect that the middleware is complex or that the integration path is unusual. What it will not know, without being told, is that the complexity is a defensive boundary rather than an accidental one, and that crossing it carries a cost the code itself does not express.
The category failure in most agentic engineering setups is not missing prompts. It is missing context. This paper focuses specifically on pre-execution context artifacts, meaning information made available to an agent before it acts. Complementary directions exist, including agent questioning during task execution and dynamic context discovery at runtime; these are not alternatives but extensions, and are out of scope here.
Confidence Requires Grounding
The word confidence comes up often in discussions of generative AI (genAI) and coding agents, usually as a property of the model: how confident is the model in its output, how well does it perform on a benchmark, how consistently does it produce correct results across a defined set of tasks.
That framing is useful for model evaluation. It is insufficient for calibrated trust in a deployment context.
An agent can be highly capable and still be operating with the wrong mental model of the repository it is changing. A high benchmark score does not indicate that the agent understood that the file it was editing sits on a regulatory boundary. A high task completion rate does not indicate that the completed tasks respected the constraints the repository requires.
Confidence in a coding agent, as a practical operational concept, is not a property of the model in isolation. It reflects several factors, including model capability, task scope, harness quality, and the maturity of the deployment context. In practice, it depends on the model operating with sufficient local context to distinguish between what is generically correct and what is locally appropriate. That calibration only exists if the local context has been made legible in advance.
This shifts confidence from a vendor claim into an operational construct. It becomes something that accumulates over time, across repeated tasks in a specific context domain, as the agent demonstrates that it can navigate the repository’s reality rather than simply produce competent code against a generic task description. It can be observed in the friction, or absence of friction, that contributions generate at review. It can be tracked as a function of context quality alongside task scope and model version.
A natural counterpoint is that improving model capability will eventually reduce or eliminate this context dependence. That may be true for some dimensions of the gap. Models that reason more deeply, retrieve more selectively, or maintain larger context windows may handle some categories of inference more reliably. But the gap this paper is describing is not primarily an inference problem. It is an information problem. A more capable model still cannot know that a warning comment was written after a production incident, or that a specific path is off-limits because of an organizational decision made three years ago, unless that information exists somewhere in the repository it can read. Capability improvements address what a model can do with the information it has. They do not address what the model cannot know because the information was never written down.
That reframing has a practical implication. Improving confidence in coding agents is not solely a function of upgrading the model. It is also a function of improving the context the model operates within. In many repositories, the highest-leverage improvement available is not a better model. It is a maintained AGENTS.md and a repository reality artifact that someone has actually kept current.
Why This Gets More Consequential Over Time
In a hybrid workflow, a developer is co-present with the agent throughout the task. The session is continuous; the developer observes what the agent is doing, redirects a prompt heading the wrong direction, and catches a problematic suggestion before it lands. Critically, the developer also brings the interpretive layer the agent lacks: the accountability instinct, the institutional memory, the sense of which signals in the codebase mean “be careful” versus “this is just old code.” Missing or thin context helpers in this mode produce suboptimal output. The human-in-the-loop absorbs the gap.
But absorbing the gap has a cost that extends beyond the immediate task. A developer who repeatedly redirects, corrects, or overrides an agent working without sufficient context starts to form an impression: that the agent is ignorant, unreliable, or simply not useful for serious work. That impression is not wrong given the conditions, but it attributes the failure to the agent rather than to the missing context. The agent is not incapable. It is uninformed. One conclusion leads to abandoning the tool. The other leads to fixing the information problem.
That choice becomes considerably less optional as the workflow moves away from the hybrid model.
In a minimally supervised or fully autonomous workflow, that absorption mechanism is gone. An agent assigned a task through a ticket integration reads its instructions, acts across the change set, and surfaces output at the pull request stage. There is no developer present to redirect a misframing mid-task. Every decision the agent makes between assignment and pull request is guided by what it knew before it started. In that environment, context helpers are not a quality enhancer. They are the primary mechanism for grounded operation.
The delivery industry is moving steadily toward the autonomous end of this spectrum. Hybrid workflows are the current dominant mode and will remain common. But autonomous coding agents, assigned tasks from backlogs and producing pull requests without human co-authoring, are already in active use. The gap that is survivable in hybrid operation becomes a prerequisite failure in autonomous operation.
There is a corollary worth naming directly. A repository that demonstrates reduced friction after introducing explicit context helpers is not only showing that its context helpers are working. It is demonstrating the precondition for safe autonomous operation: that an agent can act against that codebase with sufficient grounding to require less human correction. Context readiness and autonomy readiness are not separate concerns. They are the same concern at different points on the adoption curve.
The Practical Position
This paper is not arguing that every repository needs a full agent operating system, an exhaustive documentation suite, or a formal governance framework before any agent touches any code. That would be the wrong conclusion, and it would produce exactly the kind of overhead that most engineering teams cannot realistically sustain. It would also produce bloated context windows: loading an agent with more than it can usefully attend to degrades the very reasoning it was intended to support, working directly against the goal.
The argument is more specific. The minimum viable context layer for a repository is small enough to build and maintain, and consequential enough to materially change how an agent behaves. Three artifacts, maintained with the same care as the codebase they describe, are sufficient to close the gap between generic capability and local reality for most use cases.
This position is predominantly, but not exclusively, centered on the kinds of repositories where the gap is widest: large software estates, long-lived product codebases, and systems operating under sustained delivery pressure in regulated environments. Greenfield projects and small single-purpose repositories may present a narrower version of the same problem. The argument applies to any complex, mature codebase where defensive patterns, organizational constraints, and accumulated history have created a reality that is not visible in the code alone, regardless of organizational scale or sector. The defining characteristic is not the size of the organization. It is the gap between what the codebase looks like and what it means.
The target state is not maximal documentation. It is sufficient context. Sufficient means the agent can distinguish between what is generically correct and what is locally appropriate. Sufficient means the context reflects the current state of the repository, not its intended state from three years ago. Sufficient means the agent knows what it does not know, because the context helper has told it where the fragile boundaries are and what caution looks like in this specific codebase.
That is a different bar from comprehensive, and a more achievable one.
Sustaining a minimal context layer also requires a method for keeping it current. Context that accurately described a repository eighteen months ago may actively mislead an agent today, particularly in active codebases where defensive patterns and fragile boundaries shift with each delivery cycle. An artifact that drifts becomes a liability rather than an asset.
One practical direction here is the development of repository analysis agents, or purpose-built prompts oriented toward the meta-task of context generation and maintenance. These are not the coding agents themselves but analytical counterparts: profiles designed to examine a repository’s structure, history, and observable patterns, then produce or refresh the context artifacts that coding agents depend on. An analytical pass that identifies clusters of warning comments, unusual test isolation, or exception handling concentrated in particular modules can surface the kind of signals that belong in a REPOSITORY.md without requiring a team member to reconstruct them from memory. The same approach can flag when an existing artifact has drifted meaningfully from what the codebase now contains.
This carries a secondary benefit relevant to the over-indexing risk identified earlier. An agent whose specific task is to produce concise, accurate context artifacts (rather than comprehensive documentation) has a natural forcing function toward sufficiency. Its output is bounded by what a coding agent can usefully attend to, which keeps context generation in productive tension with context consumption. The goal is not to write everything down. It is to write the right things down, in a form that remains readable at inference time. Empirical analysis of context files in production use finds that files under approximately 60 to 100 lines tend to be followed more reliably, and that specific, executable guidance outperforms broader architectural description as a driver of correct agent behavior (Chatlatanagulchai et al., 2025). An analytical agent constrained to produce within those bounds is less likely to generate an artifact that becomes its own source of friction.
The repositories that are hardest for agents to operate in safely are not necessarily the most complex ones. They are often the ones where years of delivery pressure have accumulated context that no one ever wrote down, where defensive patterns are legible to the team but misread by any agent arriving without context about what those patterns mean. For those repositories, a thoughtfully maintained AGENTS.md and a repository reality artifact are not overhead. They are the practical difference between an agent that helps and an agent that generates work.
The conversation around coding agents has been asking “what can they do?” for long enough that the answers have become genuinely impressive. The question that has not received equivalent attention is “what do they need to know before they act?” The answer, in most cases, is less than you might expect, and more than most repositories currently provide.
The gap is not a model problem. It is a context problem. In most cases, teams already have the information needed to close it, not in documentation, but in the codebase itself. Warning comments, concentrated exception handling, duplication that signals blast-radius caution, and areas with conspicuously minimal change history are the signals. What is missing is the step of reading those signals deliberately, interpreting what they mean in the context of this repository’s history and constraints, and writing that interpretation down in a form a coding agent can act on. Recovering lost institutional knowledge is harder when key people have left and the reasoning behind a defensive pattern exists nowhere, but the process of looking for it is the same. Sufficient context means the agent has been told where the careful zones are and why, so that it approaches those areas with the instinct a developer would bring: hesitancy, reduced scope, and a preference for minimal change over a clean refactor it has not earned the right to make.
Validating the Position
The argument in this paper is falsifiable, and that matters. The ETH Zurich finding is the standing warning: introducing context files does not automatically improve outcomes. It may produce worse outcomes, at higher cost, if the files add requirements rather than clarity. A position paper that asserts “context helpers improve agentic behavior” without a mechanism for testing that claim would be doing exactly what it criticizes: stating a state of affairs rather than grounding it in something observable.
The proposed measurement approach separates outcomes into two levels. The first asks whether context helper adoption changes delivery quality at the organizational level. The second asks whether repositories are actually adopting usable, current helper content rather than merely adding files.
At the delivery level, the primary signals are: friction on AI-involved pull requests (meaning any pull request where an AI coding agent contributed to authorship, whether in a hybrid or fully autonomous mode), compliance and control exceptions in AI-involved contributions, and the share of repositories meeting a credible readiness bar for minimally supervised operation.
Delivery friction resolves into two distinct metrics rather than a single speed measure. Churn Rate tracks the proportion of AI-involved pull requests reopened at least once after initial submission. That is a signal of instability or misalignment in the original contribution. Drag Rate tracks the proportion of AI-involved pull requests that receive at least one reviewer-requested change. That is a signal of weak first-pass alignment. Review-to-merge time is tracked as a secondary indicator, interpreted only alongside churn and drag. A pull request that merges quickly because a reviewer gave up is not an improvement.
A Normalized Control Exception Rate tracks compliance, governance, security, and privacy flags per AI-involved pull request, expressed as a rate to allow comparison across repositories with different volumes. Autonomy-Ready Repository Coverage tracks the proportion of in-scope repositories meeting a minimum set of conditions simultaneously: usable helper content that passes a quality and currency rubric, adequate AI-involved pull request volume for credible measurement, delivery friction at or below defined thresholds, and control exceptions at or below a defined threshold, sustained across a defined window rather than observed as a single point.
These measures are valid only under conditions that cannot be assumed. Attribution of a pull request as AI-involved must be defined and applied consistently before the measurement window opens. Repositories below a minimum pull request volume in a period do not produce decision-grade metrics; they contribute directional signal only. The comparison design requires a pre-baseline period and a post-baseline period of comparable length to separate intervention effect from background trend. And presence of helper files is explicitly not sufficient: adoption measures must distinguish file existence from usable, current repository context. A helper artifact is considered usable when it is accurate (reflects actual repository behavior, not intended state), repository-specific (not generic guidance that would apply equally to any codebase), and confirmed current within a defined recency window, for example by being reviewed or updated within the last 90 days.
The measurement framework also carries explicit limits. These measures provide indications of impact; they do not prove cause. Helper quality, model capability, harness configuration, and developer practice all influence the outcomes observed. A result showing no improvement or negative movement after helper introduction does not establish that context helpers are ineffective in general. It may mean the helpers introduced were thin, stale, or misaligned with what the repository actually required. That returns to the paper’s central argument about sufficiency rather than presence.
This work is in progress. A defined cohort of repositories is moving through this measurement framework now. The results, as they accumulate, will either support the position this paper takes or complicate it. What they will not change is the nature of the problem. Agents operating without sufficient local context are not failing. They are succeeding against a description of the work that was never complete to begin with.
Appendix: Selected References
The following sources informed the discussion in this paper, particularly the evidence that repository context is an emerging research and practitioner concern, and the observation that its explanatory and organizational dimensions remain underaddressed relative to its technical dimensions.
Tabby AI (2023). “Repository Context for LLMs in Large Codebases.” Industry blog post demonstrating that even capable models produce functionally incorrect code without repository-specific context, including calling wrong function variants. Frames repository context as a critical requirement for correct results in complex projects, not merely a quality enhancement.
Liu et al., ByteDance (December 2024). “ContextModule: Repository-Level Context Integration for Code Completion.” arXiv preprint. Notes that current LLM code completion tools primarily rely on immediate file context and miss valuable repository-level information. Explicitly frames cross-file context integration as a practical challenge that “remains under-explored” in industry and academic settings.
Shrivastava et al., Amazon Science (2024). “Repoformer: Selective Retrieval for Repository-Level Code Completion.” ICML 2024. Quantifies that repository context is crucial in roughly 20% of code completion cases and develops a selective retrieval policy to identify when context is needed versus when local context suffices. Treats context sufficiency as an engineering problem with measurable properties.
Google Cloud AI (February 2024). “Context-Aware Code Generation with Retrieval-Augmented Generation.” Blog post from the Vertex AI Codey team demonstrating that models produce “convincing and coherent” but incorrect output when lacking project-specific knowledge, including hallucinated API calls. Shows that RAG-based injection of codebase context yields correct, standards-compliant output. Frames organizational codebase knowledge as a first-class input to correct generation.
Cane, B. (May 2026). “Your Coding Agent Is Missing One Thing: Architectural Context.” Practitioner blog post. Argues that agents lack the architectural rationale that human developers absorb through discussion and tribal knowledge. Advocates for Architecture Decision Records (ADRs) and similar artifacts as a mechanism for conveying system constraints to agents in written form, on the grounds that “agents only know what you capture.”
Vasilopoulos, A. (2026). “Codified Context.” Research paper. Treats documentation as “load-bearing infrastructure that AI agents depend on to produce correct output.” Observes that single-file context approaches (such as a monolithic AGENTS.md) do not scale to large codebases, and that agents require structured, persistent, and granular context to avoid misaligned changes in complex systems.
OpenAI Engineering (February 2026). “Harness Engineering: Leveraging Codex in an Agent-First World.” Reports that context management became “one of the biggest challenges in making agents effective at large and complex tasks.” Documents the failure mode of over-dense context in a single AGENTS.md and the shift to a structured knowledge base with AGENTS.md serving as a table of contents. Concludes that curated, well-organized project context is a prerequisite for reliable autonomous agent operation.
AGENTS.md Format (2025). “A simple, open format for guiding coding agents.” The format specification defining the AGENTS.md artifact and its intended role as a persistent, repository-level instruction file for coding agents.
Lulla et al. (January 2026). “On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents.” arXiv:2601.20404. Paired same-task/same-repo study across 10 repositories and 124 pull requests, measuring wall-clock execution time and token usage with and without AGENTS.md present. Reports AGENTS.md presence associated with a 28.64% lower median runtime and 16.58% reduced output token consumption, while maintaining comparable task completion rates. The corpus was filtered in two stages before measurement: structurally, to repositories with a single root-level AGENTS.md only (no subdirectory or conflicting files); and by content, retaining only files that contained all three of conventions and best practices, architecture and project structure, and project description, classified by LLM with manual verification. The efficiency gains therefore reflect substantive, curated files rather than any file named AGENTS.md in the wild. Provides a template for measuring operational friction reduction as a function of context artifact presence and quality.
Gloaguen et al., ETH Zurich SRI Lab (February 2026). “Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?” arXiv:2602.11988. Evaluated three conditions (no context, LLM-generated context, and developer-written context) across 138 tasks from 12 Python repositories. Found that context files tended to reduce task success rates compared to no context and raised inference cost by over 20%. Attributes the degradation to unnecessary requirements that made tasks harder rather than easier, and recommends that context files describe only minimal requirements. Taken alongside Lulla et al., establishes that content quality and minimality determine whether a context artifact is a net positive, not presence alone.
Chatlatanagulchai et al. (November 2025). “Agent READMEs: An Empirical Study of Context Files for Agentic Coding.” arXiv:2511.12884. Empirical analysis of context files in active use across more than 2,500 repositories. Found that files under approximately 60 to 100 lines are followed more reliably, that the average context file reads at legal-document difficulty levels, and that executable command-oriented content outperforms architectural description. Provides content-quality indicators (presence, commit recency, readability, length, and command density) relevant to evaluating context artifact sufficiency in practice.
Falessi, D. & Kazman, R. (2021). “Worst Smells and Their Worst Reasons.” International Conference on Technical Debt (TechDebt 2021). arXiv:2103.09537. Survey of 71 developers and 27 Apache projects. Frames ordinary code smells as rational economic trade-offs, observing that developers may choose to optimize short-term benefits like time to market over long-term maintainability, and that retaining a smell can be the lower-risk choice. Provides scholarly grounding for treating defensive patterns as adaptive responses to delivery conditions rather than evidence of poor engineering practice.
Tufano et al. (2015/2017). “When and Why Your Code Starts to Smell Bad.” ICSE 2015; extended in IEEE Transactions on Software Engineering, vol. 43, no. 11, 2017. Analysis of 200 Java projects and over 500,000 commits. Finds that most code smells are introduced at file creation, that approximately 80% persist, and that the majority are introduced in the final month before a release by experienced developers performing complex tasks. Empirically locates the delivery-pressure origin of defensive patterns, supporting the framing that these are deadline-driven adaptations rather than accidental accumulations.
Potdar, A. & Shihab, E. (2014) / Maldonado, E. & Shihab, E. (2015). Potdar & Shihab, ICSME 2014 / Maldonado & Shihab, MTD 2015. Potdar and Shihab identified 62 recurring comment patterns that indicate technical debt acknowledged directly in code, finding instances in up to 31% of analyzed files and noting that self-admitted debt is introduced more frequently by experienced developers and persists over time. Maldonado and Shihab classified self-admitted technical debt into five types (design, defect, documentation, requirement, and test), with design debt comprising the majority. These studies operationalize the “warning comments as institutional memory” claim central to the defensive-patterns section of this paper.