Agentic AI in Digital Forensics

Concerns, Safeguards, and Lessons from a Real Investigation

Can agentic AI assist digital forensic investigators? This article examines concerns from peer-reviewed literature, surveys deployment options for sensitive data, and shares practical lessons from using Claude Code in a forensic investigation — including the open-source arti toolkit developed during the project.
Hunting
Published

February 25, 2026

Introduction

This article describes the use of agentic AI in a digital forensic investigation. It begins by explaining what agentic AI is and how it works, then discusses known concerns regarding AI in digital forensics drawing on peer-reviewed literature. It surveys deployment options for organizations handling sensitive data, describes the safeguards and methodology that were developed during the investigation, and concludes with practical observations from using AI as a forensic analysis tool.

The investigation used Claude Code as the primary agentic AI tool. The methodology and toolkit developed during the project have been open-sourced as arti, a framework for AI-assisted digital forensic investigations.

What is Agentic AI?

Traditional large language models operate in a request-response pattern: the user provides a prompt, the model generates a response, and the interaction ends. Agentic AI extends this paradigm by giving the model the ability to take actions in the real world, observe the results, and iterate until a task is complete.

An agentic AI system typically has access to file system operations (reading, writing, and editing files), command execution (running shell commands), search capabilities (searching codebases, web content, or databases), and tool invocation (calling external APIs or specialized tools). The key distinction is autonomy: rather than producing a single response, an agentic AI can plan a multi-step approach, execute each step, evaluate the outcome, and adjust its strategy accordingly. This makes agentic AI particularly suited for complex tasks like software development, system administration, and digital forensics.

The ReAct Pattern

Most agentic AI systems, including Claude Code, implement a variation of the ReAct (Reasoning and Acting) pattern [1]. This pattern interleaves reasoning with action in a cycle: the agent receives input (a user request, file contents, or command output), reasons about what to do next (often producing explicit chain-of-thought reasoning), selects and executes a tool (reading a file, running a command, or searching), receives the tool’s output, and repeats until the task is complete. This explicit reasoning before each action provides transparency into the agent’s decision-making process and allows for better error recovery when actions produce unexpected results.

How Claude Works

Claude is developed by Anthropic and represents a family of large language models. While Anthropic does not publicly disclose parameter counts, industry analysis suggests the models are substantial in size, with estimates for Claude 3 Opus ranging from 137 to over 200 billion parameters (unconfirmed).

Claude’s training combines two key approaches. Reinforcement Learning from Human Feedback (RLHF) is the standard approach for aligning large language models: human raters evaluate model outputs, a reward model learns to predict human preferences, and the language model is then fine-tuned to maximize this reward signal. While effective, RLHF requires extensive human labeling effort. Constitutional AI (CAI), developed by Anthropic as a more scalable alignment approach [2], has the model critique its own outputs against a set of defined principles. This self-critique process reduces dependence on human labelers, scales more efficiently than pure RLHF, allows explicit encoding of values and guidelines, and enables the model to explain why certain outputs violate principles.

Claude Code is Anthropic’s official CLI tool that provides Claude with agentic capabilities. When operating in agentic mode, Claude can read and write files on the local filesystem, execute shell commands, search codebases using patterns and regular expressions, create and manage task lists, and spawn sub-agents for specialized tasks. The tool implements safety measures including user approval for sensitive operations and sandboxing of command execution. These safety measures were largely disabled during the investigation described here, as approving every operation is not practical when it slows down the work significantly.

Concerns for Using AI in Digital Forensics

Several peer-reviewed publications address the challenges and risks of integrating AI into forensic investigations. The following subsections identify key concerns from the literature and describe how each was addressed in this project.

Algorithmic Bias and Training Data Quality

AI models trained on biased or incomplete datasets may produce false positives or negatives, leading to incorrect conclusions [3]. Training datasets may lack diversity, potentially encoding racial, gender, or social biases [4].

In this investigation, agentic AI was used to identify forensically interesting data across large evidence volumes. The principal risk is that the AI may have overlooked anomalies that a human investigator would have detected without AI assistance. This risk was mitigated through repeated follow-up questions whenever correlations suggested by the AI appeared inconsistent or incomplete. Resource limitations must also be taken into account, as they constrain any investigation regardless of tooling. The AI was also used to perform automated scans across all IP addresses in the affected networks, a task that would not have been feasible to conduct manually within the available time.

Lack of Explainability and Transparency

“Black box” AI models are difficult to justify in court, where transparency is crucial [3]. The trend toward “glass box” (interpretable) AI is gaining traction because interpretable AI ensures accuracy and fairness in forensic evidence and influences judicial rulings [5].

All findings in this investigation are presented with specific evidence references, and the report includes instructions for verifying each claim against the source artifacts. The chain of custody has been managed throughout the process, and every analytical step is documented and reproducible.

Push-Button Investigators

A significant concern is untrained investigators using AI tools without forensic knowledge, creating “push-button digital forensic investigators” [3]. These investigators upload artifacts and let AI perform all investigative work without understanding the underlying methodologies, leading to potential misinterpretation of results.

The practical experience from this project is that agentic AI is not close to being able to perform a full investigation autonomously. Continuous guidance and detailed prompts are necessary to move the investigation forward. The AI requires an investigator who understands what questions to ask, how to evaluate the answers, and when to redirect the analysis.

Human Verification Required

AI should complement human expertise rather than replace it, as both machine and human errors can influence the final analysis [6]. AI findings must always be verified by qualified forensic analysts who satisfy “first-party knowledge” requirements before attesting to facts in reports or testimony [3].

Not only is human verification needed, but human guidance throughout the entire process. Agentic AI demonstrates strong capability at well-scoped isolated tasks, for instance extracting specific malware from a memory or disk image, or identifying all DNS queries to a particular domain. However, it has not been possible to have the AI independently construct a full attack timeline. Typical usage followed patterns such as: given a compromised host communicating with a known suspicious IP, search for additional suspicious communication involving that IP across the remaining evidence.

Privacy Concerns

Digital forensics tools can access vast amounts of data, often exceeding what is strictly necessary for an investigation [3]. AI can process and correlate data at scale, increasing the risk of exposing irrelevant personal information [3].

The most thorough approach to the privacy concern is to use isolated installations where data never leaves the organization’s infrastructure. It is possible to set up agentic AI using self-hosted open-weight models such as Llama, Qwen, and Kimi K2, but doing so requires substantial hardware investment. The deployment options are discussed in detail in the Deployment Options section. An alternative approach, used in this investigation, is to extract structured data from raw evidence into intermediate data files that can be sanitized of personally identifiable information before the AI processes them, as described in the Investigation Methodology section.

Data Integrity and Chain of Custody

AI tools should operate under strict chain-of-custody protocols to ensure that AI-derived evidence remains admissible in court [6]. Evidence must remain unchanged, uncorrupted, and unmodified [4]. Sometimes data is not appropriately acquired by law enforcement or analyzed to preserve data integrity [4].

Documented and strictly enforced rules are necessary to prevent agentic AI from modifying evidence. The safeguards developed for this project are described in the Safeguards section. Evidence integrity was protected by storing files on an ext4 filesystem with read-only permissions, since the original exFAT storage lacked proper permission support. Hash verification using both MD5 and SHA256 with timestamped manifests allowed detection of any modifications over time. The need for evidence intake routines and hash verification is no different from an investigation performed without AI assistance.

Lack of Industry Standards

There are no standardized AI models or methodologies, making it challenging for forensic investigators to comprehend all functions and models of their tools [3]. The technology, processes, and available resources used during forensic investigations have seldom changed in twenty years, while AI represents the most significant change in this period [7].

This is as much a question of legal system maturity as of technical standardization. Agentic AI tools perform the same tasks that a human investigator performs, but faster and at greater scale. Ensuring that all presented facts are verifiable against source evidence, combined with proper chain of custody, makes the results of AI-assisted investigations comparable to those of traditional investigations. As the technology matures, standards will need to evolve to address AI-specific considerations.

Adversarial Attacks and Anti-Forensics

AI systems can be manipulated through poisoning training data to evade detection [3]. Cybercriminals can use AI to improve their skills, bypass automatic detection, and develop advanced attack techniques including anti-forensics [4]. Steganography and other techniques can be generated using LLMs to hide malicious payloads [4].

The increasing complexity of attacks and the growing use of anti-forensics techniques make it all the more important to conduct large-scale investigations efficiently. Restricting investigators from using AI tools that attackers are already employing would place the defense at an increasing disadvantage.

Robustness Issues with Synthetic Media

AI algorithms may lack robustness for specific presentation attacks, such as deepfake images [4]. Current systems lack robustness because of insufficient training datasets from real cases. Defendants are beginning to claim in trials that multimedia files were AI-generated, raising questions about how to prove attribution of generative AI media [4].

By using agentic AI effectively on tasks where it performs well, investigators can free time for work that demands human expertise, such as analyzing synthetic media and deepfake evidence.

Generative AI Hallucinations

Generative AI algorithms can produce output with requested characteristics but lacking logic or factual basis, a phenomenon known as hallucination [4]. Synthetically generated data should be supervised by human expert analysts to prevent this problem from affecting conclusions.

Agentic AI requires human guidance to produce meaningful results. Hallucinations do occur, and the system occasionally generates plausible-sounding statements that do not correspond to the actual evidence. The system also sometimes deviates from assigned tasks or loses track of earlier instructions. Agentic AI is a capable assistant for isolated, well-defined tasks, but all results must be verified by the investigator against the source artifacts. This reinforces the importance of treating AI-generated findings as preliminary observations requiring confirmation rather than established facts.

Deployment Options

For organizations handling sensitive data such as forensic evidence containing personally identifiable information or legally privileged material, cloud-based AI services may not be acceptable. Several options exist for private deployment.

Enterprise Cloud Options

Commercial platforms offer varying levels of isolation, though none provide true dedicated hardware or on-premise deployment (see Table 1).

Table 1: Enterprise cloud options for agentic AI. 1Model sizes are unofficial estimates.
Platform Isolation Agentic Tools Shell Size1
AWS Bedrock + Q Developer Account/session, VPC, FedRAMP Q Developer, AgentCore Yes 13B-405B
Claude Enterprise Cloud-based, HIPAA-ready Claude Code, Agent SDK Yes >175B
Azure OpenAI PTU dedicated capacity, tenant isolation Responses API, Agent Framework Yes ~1.8T
Google Vertex AI Project-level, VPC, FedRAMP High Agent Builder (ADK, Agent Engine) No ~1-3T

Azure’s Agent Framework and Google’s Agent Builder require engineering effort to build a working agentic system with file access, tool execution, and reasoning loops. AWS offers a ready-to-use option: Amazon Q Developer provides CLI access with shell execution and can use Claude models via Bedrock. All data stays within the customer’s AWS account, making it the closest enterprise equivalent to Claude Code for organizations requiring data isolation.

Table 2 shows per-token pricing for flagship models (prices as of January 2025) [9], [10], [11], [12].

Table 2: API pricing per 1M tokens for flagship models (USD).
Platform Model Input Output Cache Read Cache Create
Claude Direct Opus 4.5 $5 $25 $0.50 $6.25
AWS Bedrock Claude Opus 4.5 $5 $25 $0.50 $6.25
Google Vertex Gemini 3 Pro $3 $15 $0.30 $3.75
Azure OpenAI GPT-4.1 $2 $8 $0.20 $2.50

Based on actual usage during the investigation (approximately 96 million tokens per day, cache-heavy), Table 3 shows the monthly costs for different options.

Table 3: Real-world monthly cost comparison for agentic forensic investigation.
Option Monthly Cost
Claude Opus 4.5 API ~$2,300
AWS Bedrock (Claude Opus 4.5) ~$2,300
Google Gemini 3 Pro API ~$1,400
Azure GPT-4.1 API ~$930
Claude Max subscription $200

Enterprise API options are approximately five to ten times more expensive than consumer subscriptions for equivalent usage. These rates are likely negotiable for large enterprise buyers with volume commitments.

Azure’s Provisioned Throughput Units option provides the closest approximation to dedicated resources, offering guaranteed capacity and predictable latency. However, all these solutions run on shared cloud infrastructure with logical isolation rather than physical separation. AWS, Azure, and Claude support shell command execution for agentic workflows. Google Vertex AI only supports Python and JavaScript in sandboxed environments. Claude Code offers the most permissive shell access and can also connect through AWS Bedrock or Google Vertex AI, combining Anthropic’s models with those platforms’ infrastructure.

For forensic work requiring true air-gapped environments, self-hosted open-weight models are the only option.

Self-Hosted Open-Weight Models

For forensic work requiring true air-gapped environments where data cannot leave the organization’s infrastructure, self-hosted open-weight models are the only option. Running a model locally requires an inference framework that loads model weights into memory and handles computation. Common frameworks include Ollama (simple setup), vLLM (high-performance production serving), and llama.cpp (CPU-optimized inference).

Comparing models by parameter count alone is misleading, as training data quality, architecture, and fine-tuning significantly affect real-world performance. For agentic work specifically, SWE-bench (which measures real GitHub issue resolution) is more relevant than general benchmarks. On SWE-bench, Claude Opus 4.5 achieves 80.9%, while traditional open models score much lower: Llama 3.1 405B achieves approximately 25% [13], [14].

Kimi K2 changes this picture. Released in 2025 under the Modified MIT License, it achieves 65.8% on SWE-bench with built-in agentic capabilities including shell execution [15]. Its mixture-of-experts architecture has one trillion total parameters but only 32 billion active per token. With 4-bit quantization, Kimi K2 requires seven A100 80GB GPUs (approximately $119,000), substantial but feasible for organizations with security requirements that justify the investment.

For models without built-in agentic capabilities, Table 4 shows frameworks that add shell execution.

Table 4: Agentic frameworks for self-hosted models.
Framework Shell Description
Open Interpreter Yes Full shell access with any LLM
mini-SWE-agent Yes Bash-only, minimal tooling
Aider Limited Code editing, /run for tests only

Table 5 shows hardware requirements and costs for self-hosted deployment.

Table 5: Hardware requirements for self-hosted models.
Model Precision VRAM A100 80GB Cost
Kimi K2 1T (MoE) 4-bit ~500 GB 7x ~$119k
Llama 3.1 405B 8-bit ~405 GB 6x ~$102k
DeepSeek-Coder-V2 236B 8-bit ~236 GB 3x ~$51k
Llama 3.1 70B 8-bit ~70 GB 1x ~$17k
Qwen2.5-Coder 72B 8-bit ~72 GB 1x ~$17k

For air-gapped agentic forensics, Kimi K2 at 4-bit quantization offers the best capability-to-cost ratio with 65.8% SWE-bench performance and built-in shell support. Organizations with smaller budgets can use Llama 3.1 70B (approximately $17,000) with Open Interpreter, accepting reduced agentic performance.

Safeguards for Agentic AI

When an AI agent can execute arbitrary commands and modify files, establishing clear boundaries becomes critical. Without constraints, an agent might modify or delete evidence files, execute destructive commands, access systems outside the investigation scope, or produce outputs that violate legal guidelines.

Claude Code addresses this through a configuration file (CLAUDE.md) read at startup. This file contains project-specific instructions the model must follow, and can exist at multiple levels (user home directory, project root, subdirectories) with more specific rules taking precedence. Rules can prohibit specific actions, define investigation methodology, and specify documentation requirements.

The rules for this investigation were developed iteratively during the exploratory phase. Evidence integrity was protected by storing files on an ext4 filesystem with read-only permissions, since the original exFAT storage lacked proper permission support. Hash verification using both MD5 and SHA256 with timestamped manifests allowed detection of any modifications over time.

The investigation progressed through three artifact sets. Work began with set 1 alone, then expanded to sets 1 and 2 combined, and concluded with all three sets together. Each expansion introduced new evidence while preserving earlier findings, preventing premature conclusions based on incomplete information. All findings required specific evidence references with file paths, line numbers, and UTC timestamps. An append-only investigation log maintained chain of custody throughout the analysis.

These constraints matter particularly in forensic contexts. Chain of custody requires preventing any modification to evidence files that could compromise legal admissibility. Reproducibility demands documentation of all commands and extraction methods so other investigators can verify findings. Scope control ensures the AI only accesses evidence appropriate for the current phase, preventing cross-contamination of analysis. An append-only audit trail ensures all actions are recorded and cannot be retroactively modified.

Investigation Methodology

Evolution of the Approach

The investigation methodology evolved significantly as practical experience accumulated. The initial approach had Claude directly execute forensic tools for memory analysis, packet parsing, and log analysis. While this produced results, it created a transparency problem: the AI would run commands, interpret results, and present findings, but the intermediate reasoning was difficult to audit or reproduce. This led to the development of a structured toolkit that separates extraction, analysis, and interpretation into distinct layers, each with clear inputs and outputs.

The arti Toolkit

The toolkit that emerged from this iterative refinement has been open-sourced as arti (AI-assisted digital forensics framework). While currently validated with Claude Code, the architecture supports migration to other AI platforms since all procedural guidance exists in markdown instruction files.

The toolkit follows a layered architecture. At the top layer, named command targets combine extraction and analysis steps into repeatable operations. These targets invoke analysis scripts that form the middle layer, performing the actual extraction and analysis work. At the bottom layer, configuration files specify all artifact paths and project settings, keeping the scripts generic and reusable across investigations.

The analysis framework implements modular components for network, memory, and disk forensics. Each module follows the same pattern: an extraction phase that reads binary evidence and produces structured intermediate files, followed by an analysis phase that examines the extracted data and produces reports highlighting potentially significant findings. A status tracking mechanism records which extraction steps have completed, allowing the process to resume without re-extracting previously processed artifacts.

Specialized query tools complement the primary framework: flow summarization, top-talker identification, beaconing detection, MAC address correlation, packet indexing with TLS fingerprints, Windows triage analysis (event log timelines, browser history, program execution, lateral movement indicators), and network flow visualization.

All artifact paths are read from a central configuration file rather than hardcoded in scripts, ensuring that the same scripts work across different investigations by changing only the configuration.

Tier-Based Evidence Isolation

A critical refinement was the introduction of tier-based evidence isolation. The investigation progressed through three artifact sets representing different perspectives: victim endpoint data, network traffic, and adversary captures. Each tier receives its own extraction output directory and its own packet database. Scripts operating on a given tier read only that tier’s artifacts by default, preventing accidental inclusion of evidence from other tiers in the analysis.

This isolation was introduced after discovering that the original cumulative design, where higher tiers automatically included all lower-tier data, caused subtle contamination. Correcting the extraction to use per-tier paths exclusively resolved the issue and ensured that each tier’s analysis reflected only its own evidence.

Cross-tier correlation, when needed, is performed through dedicated commands that explicitly merge data from multiple tiers. A MAC address correlation tool, for instance, queries all tier databases to track devices across evidence sets. This deliberate opt-in approach to cross-tier analysis ensures the investigator is always aware when evidence boundaries are being crossed.

Three-Stage Investigation Cycle

Each tier follows a three-stage investigation cycle. In the first stage, automated extraction and analysis scripts process the raw evidence and produce structured outputs. The investigator reviews these outputs and records factual observations in an exploration findings document, deliberately avoiding interpretation at this point. In the second stage, the investigator develops an investigation plan based on the observations, identifying leads to pursue and evidence gaps to address. In the third stage, deep-dive analysis examines specific leads in detail, returning to the original binary artifacts when the extracted data raises questions that the summaries cannot answer. This stage produces a consolidated timeline documenting who did what and when, with every entry traceable to a specific source artifact.

This cycle often repeats within a single tier. Deep-dive analysis frequently uncovers activity that the automated scripts did not anticipate, which in turn informs the development of new extraction scripts and analysis targets. The result is an iterative process where each round of analysis expands the toolkit’s coverage.

Separation of Automated and Manual Work

The output directory structure enforces a clear separation between automated script outputs and manual investigator notes. Script-generated extractions and analyses are stored in an automated directory that the investigator does not edit directly. Manual findings, investigation plans, and consolidated timelines are stored in a separate manual directory. This separation ensures that re-running the automated extraction never overwrites investigator notes, and that the provenance of every file is immediately clear from its location.

All extracted files follow a standardized format with UTC timestamps, enabling consistent timeline construction across evidence types. Network flows, DNS queries, TLS handshakes, Windows event logs, and process listings all use the same timestamp convention, allowing them to be merged into unified timelines.

Automated Screening and AI-Assisted Interpretation

The methodology operates as a two-phase workflow. In the first phase, the Python scripts process raw evidence and flag potentially significant artifacts. The scripts encode investigative heuristics explicitly, such as suspicious port numbers, anomalous process relationships, and known persistence mechanisms. The scripts output structured data documenting what was found and why it was flagged. In the second phase, Claude reviews these flagged items, investigates context, filters false positives, and synthesizes findings into coherent narratives.

This separation provides auditability, as another investigator can read the Python code to understand exactly what patterns triggered an alert. The process is reproducible: running the same scripts on the same evidence produces identical outputs. The intermediate data files provide a clear audit trail showing what data informed each conclusion.

The approach carries an inherent risk: the scripts encode assumptions about what constitutes suspicious activity, and those assumptions may be incomplete or inappropriate for novel attack techniques. Findings that fall outside the screening heuristics will not be flagged automatically. This risk is mitigated by the deep-dive cycle described above, where the investigator returns to original artifacts when extracted data reveals inconsistencies or gaps. The risk is not eliminated, and maintaining awareness that script-based screening provides coverage rather than completeness remains important.

Data Confidentiality Through Extraction

The extraction step creates an opportunity to address data confidentiality. The intermediate data files can be sanitized of personally identifiable information and other sensitive data before the AI processes them. Because the agentic AI operates on extracted and sanitized files rather than raw evidence, non-enterprise AI services can be used without risking leakage of sensitive information to external providers. The original unredacted evidence remains available locally for deep dives that require full fidelity.

Cross-Evidence Correlation

Cross-evidence correlation remained a task performed with AI assistance rather than through scripted automation. The relationships between evidence sources are often subtle and context-dependent: a DNS query becomes significant only when correlated with a suspicious process spawning at the same timestamp, which in turn connects to a registry modification hours later. Claude’s ability to hold multiple evidence threads in context and reason about their relationships proved more practical than attempting to encode all possible correlation patterns in code.

The final methodology thus represents a hybrid approach: deterministic scripts for systematic extraction and screening of known patterns, tier-isolated databases for controlled evidence management, AI assistance for interpretation and cross-correlation, and human oversight for strategic direction and final conclusions.

Summary

This forensic investigation provided an opportunity to evaluate agentic AI in a realistic setting with substantial evidence volumes. The technology proved particularly valuable for processing large datasets. The investigation involved millions of log entries and extensive network captures that would have required significantly more time to analyze manually. When given clear, well-scoped tasks like extracting DNS queries to external domains or correlating timestamps across evidence sources, the system worked efficiently and consistently.

The experience also revealed important limitations. The system functioned as a capable assistant but not as an autonomous investigator. Strategic decisions about which leads to pursue, how to interpret ambiguous evidence, and when a line of inquiry had become unproductive required human judgment. The system occasionally generated plausible-sounding but incorrect statements that had to be verified against the actual evidence. This reinforced the importance of treating AI-generated findings as preliminary observations requiring confirmation rather than established facts.

Agentic AI is best understood as a force multiplier for human investigators rather than a replacement. It accelerates the mechanical aspects of forensic analysis while the human provides the investigative intuition, maintains appropriate skepticism, and takes responsibility for conclusions.

The arti framework captures the methodology and tooling developed during this investigation, providing a starting point for other practitioners exploring AI-assisted digital forensics.

References

[1]
S. Yao et al., “ReAct: Synergizing reasoning and acting in language models,” in International conference on learning representations (ICLR), 2023.
[2]
Y. Bai et al., “Constitutional AI: Harmlessness from AI feedback,” arXiv preprint arXiv:2212.08073, 2022.
[3]
J. Hollindhead, “Digital forensics, AI, and concerns: What is and what is not,” 2025.
[4]
S. L. Sanna, L. Regano, D. Maiorca, and G. Giacinto, “Improving cybercrime detection and digital forensics investigations with artificial intelligence,” in APWG-EU 2025: Tech summit and researchers forum, Cagliari, Italy, 2025.
[5]
B. Garrett and C. Rudin, “Interpretable algorithmic forensics,” Proceedings of the National Academy of Sciences, vol. 120, 2023, doi: 10.1073/pnas.2301842120.
[6]
V.-A. Carcale, “The future of artificial intelligence (AI) applications in forensics,” in RAIS conference proceedings, Romania: Research Association for Interdisciplinary Studies, 2025. doi: 10.5281/zenodo.15481632.
[7]
S. Khan and S. Parkinson, “The role of artificial intelligence in digital forensics: Case studies and future directions,” Digital Forensic Journal, vol. 47, p. 301637, 2023, doi: 10.1016/j.dfj.2023.301637.
[8]
C. Kerdvibulvech, “Big data and AI-driven evidence analysis: A global perspective on citation trends, accessibility, and future research in legal applications,” Journal of Big Data, vol. 11, no. 180, 2024, doi: 10.1186/s40537-024-01046-w.
[9]
Anthropic, “Claude API pricing.” https://www.anthropic.com/pricing, 2025.
[10]
Amazon Web Services, “Amazon bedrock pricing.” https://aws.amazon.com/bedrock/pricing/, 2025.
[11]
Google Cloud, “Vertex AI pricing.” https://cloud.google.com/vertex-ai/pricing, 2025.
[12]
Microsoft Azure, “Azure OpenAI service pricing.” https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/, 2025.
[13]
Anthropic, “Introducing claude opus 4.5.” https://www.anthropic.com/news/claude-opus-4-5, 2025.
[14]
Vellum AI, “Claude opus 4.5 benchmarks (explained).” https://www.vellum.ai/blog/claude-opus-4-5-benchmarks, 2025.
[15]
Moonshot AI, “Kimi K2: Open agentic intelligence.” https://github.com/MoonshotAI/Kimi-K2, 2025.