Kiji Privacy Proxy: Secure Open-Source PII Masking for AI

Article Content
In the rapidly evolving landscape of generative artificial intelligence, the “digital ninja”—the modern developer, data scientist, or security architect—faces a paradoxical challenge. On one hand, the productivity gains offered by Large Language Models (LLMs) like OpenAI’s GPT-4 and Anthropic’s Claude are too significant to ignore. On the other hand, the risk of leaking sensitive information to these cloud-based black boxes has never been higher. As organizations race to integrate AI into their core workflows, the exposure of personally identifiable information (PII) has become the single greatest barrier to widespread adoption. Enter the Kiji Privacy Proxy, a revolutionary open-source utility released by Dataiku on May 1, 2026, designed to act as a sophisticated, local “sanitization layer” for the AI era.
The Kiji Privacy Proxy is not merely a filter; it is a high-performance local gateway that intercepts outbound AI prompts, identifies sensitive data points, and masks them with realistic dummy values before they ever touch the public internet. This ensures that while the LLM receives the context it needs to generate a high-quality response, it never sees the actual names, emails, or Social Security numbers of an organization’s clients. By restoring the original data locally within the user’s environment, Kiji allows for a seamless, privacy-compliant interaction that satisfies the stringent requirements of GDPR, CCPA, and HIPAA.
The Technical Architecture of Kiji Privacy Proxy
At the heart of the Kiji Privacy Proxy lies a sophisticated machine learning pipeline optimized for speed and privacy. Unlike many cloud-based security solutions that require sending data to yet another third party for “cleaning,” Kiji operates entirely on the local network. This is made possible through the use of a quantized DistilBERT model executed via ONNX Runtime.
The choice of DistilBERT—a smaller, faster, and lighter version of the BERT transformer model—is strategic. By using an INT8-quantized version of the model, Kiji significantly reduces the memory footprint and computational requirements of the detection process without sacrificing accuracy. ONNX Runtime allows this model to perform inference directly on the user’s CPU with remarkable efficiency. Technical specifications for the Kiji detection engine include:
- Model Type: Multi-task DistilBERT fine-tuned for Named Entity Recognition (NER) and coreference resolution.
- Inference Engine: ONNX Runtime (INT8 quantization).
- Latency: Consistently under 100ms per request, ensuring that the security layer does not become a bottleneck in the user experience.
- Sequence Length: Support for up to 512 tokens per window, allowing for substantial context analysis.
- Language Support: Trained and optimized for six major languages: English, German, French, Spanish, Dutch, and Danish.
This localized approach is fundamental to Kiji’s value proposition. By keeping the “detection” step within the corporate perimeter, Kiji eliminates the meta-risk of using a privacy tool that itself requires an internet connection, thereby closing the loop on data leakage.
Advanced Detection and Coreference Resolution
One of the most impressive feats of the Kiji Privacy Proxy is its ability to handle over 25 distinct PII types. While basic regex-based tools can identify structured data like credit card numbers or IP addresses, they often fail when faced with unstructured text or context-dependent identifiers. Kiji utilizes its transformer-based model to recognize names, locations, and even subtle identifiers that depend on the surrounding sentence structure.
Furthermore, Kiji incorporates coreference resolution. In a typical prompt, a user might mention a client name once (“John Doe”) and then refer to him using pronouns (“he,” “him,” “his”) throughout the rest of the text. Standard PII scanners might mask “John Doe” but leave the pronouns, or worse, fail to understand that a subsequent mention of “the patient” refers to the same sensitive entity. Kiji’s model is trained to recognize these clusters, ensuring that every reference to a sensitive entity is consistently masked and subsequently restored. This maintains the conversational integrity of the prompt, allowing the LLM to provide accurate results based on the relationships between entities without knowing who those entities actually are.
Benchmark Performance and Industry Standards
In terms of efficacy, the Kiji Privacy Proxy has set a new benchmark for open-source privacy tools. During its release, Dataiku reported a 94 percent F1 score on industry-standard PII detection datasets. This score represents a balanced metric of precision (avoiding false positives) and recall (ensuring no PII is missed). In the context of enterprise security, a high F1 score is critical; a tool that misses even one Social Security number is a liability, while a tool that masks non-sensitive words makes the AI’s response nonsensical.
Kiji’s performance is particularly noteworthy given its low-latency profile. In enterprise environments where developers may be sending hundreds of API calls an hour, any delay over 200ms is usually rejected. By staying under 100ms, Kiji integrates into the “hot path” of development without frustrating the end user.
A Multi-Form Factor Guard for Every Workflow
Recognizing that “digital ninjas” work across various environments, Dataiku has released the Kiji Privacy Proxy in three distinct form factors, ensuring that no matter the workflow, privacy is maintained:
- The macOS Desktop Application: Built using Electron, this native app is designed for individual developers and power users. It automatically configures Proxy Auto-Config (PAC) settings for browsers like Safari and Chrome, routing all traffic through local port 8081. This allows users to use web-based LLM interfaces without manual configuration.
- The Standalone Linux Server: For DevOps teams and enterprise-level deployments, Kiji can be run as a lightweight binary or Docker container. By setting standard
HTTP_PROXYandHTTPS_PROXYenvironment variables, entire application stacks can be routed through Kiji, providing a “transparent” privacy layer for automated pipelines. - The Chrome Extension: For those who primarily interact with AI via web chat interfaces (ChatGPT, Claude.ai, Gemini), the Kiji extension provides inline PII detection. It highlights sensitive data in the text area before the user hits “send,” offering a final manual check alongside its automatic masking capabilities.
This flexibility addresses the “shadow AI” problem—where employees use unauthorized AI tools because the official versions are too cumbersome. By making the proxy transparent and easy to install, Kiji encourages compliance through ease of use.
The Regulatory Imperative: GDPR, CCPA, and Beyond
The release of the Kiji Privacy Proxy comes at a time of heightened regulatory scrutiny. As of 2026, data protection authorities across Europe and North America have begun issuing significant fines for “data negligence” involving AI prompts. A recent Dataiku/Harris Poll survey of 600 CIOs revealed that 85 percent of organizations have seen AI projects delayed or completely blocked due to gaps in traceability or explainability, with privacy being the leading concern.
Under the GDPR, sending unencrypted or unmasked PII to a third-party processor (like a cloud AI provider) without a specific Data Processing Agreement (DPA) can lead to catastrophic legal consequences. Kiji provides the technical “de-identification” required to stay compliant. Because the data is masked before it leaves the local network, the cloud AI provider never actually “processes” the PII in a legal sense, drastically simplifying the compliance roadmap for enterprise legal teams.
Open Source Governance and the 575 Lab
Dataiku’s decision to release Kiji as an open-source project under the Apache 2.0 license is a move toward radical transparency. Developed by 575 Lab—Dataiku’s specialized open-source office—Kiji is part of a broader ecosystem aimed at making AI more interpretable and secure. By publishing not just the code, but also the trained model and the training dataset on Hugging Face (under DataikuNLP/kiji-pii-model-onnx), Dataiku allows security researchers to audit the tool for their own specific needs.
Hannes Hapke, Director of 575 Lab, emphasized that “Enterprises are embedding AI agents into decisions that influence revenue and safety, yet most lack visibility into how those systems handle raw data. Kiji is about giving that control back to the organization.” The project invites community contributions, allowing the model to evolve as new types of PII emerge and as LLM prompting techniques (like prompt injection) attempt to bypass standard filters.
The Future of Private AI Interaction
As we move deeper into 2026, the era of “blindly” sending data to the cloud is coming to an end. The Kiji Privacy Proxy represents a shift toward Edge Privacy—the idea that the most sensitive parts of our digital interactions should be managed as close to the user as possible. By combining the power of modern transformer models with the efficiency of local inference engines, Kiji proves that we do not have to choose between AI innovation and data sovereignty.
For the digital ninja, Kiji is the ultimate tool in the arsenal. It provides the stealth and protection needed to navigate the high-stakes world of generative AI without leaving a trail of sensitive data behind. Whether you are a solo developer or a CISO managing a global fleet, the Kiji Privacy Proxy is the essential gateway for a secure, AI-driven future.
By effectively “neutralizing” the risk of PII leakage, Kiji doesn’t just protect data—it unlocks the true potential of generative AI for the most regulated and data-sensitive industries on the planet. In the battle for digital privacy, the Kiji Privacy Proxy is a silent, local, and incredibly powerful ally.
Written by
TempMail Ninja
Digital privacy and online security expert. Passionate about creating tools that protect users' identity on the internet.


