# Security Fix: Response Data Filtering (Information Disclosure Vulnerability) **Date**: 2025-10-03 **Severity**: HIGH **Status**: FIXED --- ## Vulnerability Summary The API endpoints were returning excessive sensitive data without proper server-side filtering, violating the principle of least privilege. Clients were receiving complete database records including: - Internal system prompts and AI instructions - Configuration details (personality_config, resource_preferences) - Infrastructure details (embedding models, chunking strategies) - User UUIDs and relationship data - Dataset access configurations This created multiple security risks: - **Information Disclosure**: Internal system configuration exposed - **Authorization Bypass**: Resource enumeration by ID - **IDOR Vulnerability**: User relationships and ownership data exposed - **Attack Surface Expansion**: AI behavior patterns revealed through prompts --- ## Affected Endpoints ### 1. `/api/v1/agents` (List & Get) **Before**: Returned full agent configuration to all users **Issue**: Non-owners could see `prompt_template`, `personality_config`, `resource_preferences`, `selected_dataset_ids` ### 2. `/api/v1/datasets` (List & Get) **Before**: Exposed internal implementation details **Issue**: All users could see `owner_id` UUIDs, `team_members`, `chunking_strategy`, `chunk_size`, `chunk_overlap`, `embedding_model` ### 3. `/api/v1/chat/completions` **Before**: Embedded complete agent configs in context **Issue**: Chat context included full dataset summaries with internal metadata for unauthorized datasets ### 4. `/api/v1/files` (List & Get Info) **Before**: No field-level filtering **Issue**: Exposed storage paths and processing details --- ## Remediation Implemented ### 1. Created Response Filtering Utility (`app/core/response_filter.py`) Implements three-tier access control: **Agents:** - **Public Fields**: id, name, description, category, metadata, display fields (model, disclaimer, easy_prompts) - **Viewer Fields**: Public + temperature, max_tokens, costs - **Owner Fields**: Viewer + prompt_template, personality_config, resource_preferences, dataset_connection **Datasets:** - **Public Fields**: id, name, description, document_count, tags, created_at, created_by_name, access_group, permission flags (NO UUIDs, NO technical details) - **Viewer Fields**: Public + chunk_count, vector_count, storage_size_mb, updated_at, summary - **Owner Fields**: Viewer + owner_id, team_members, chunking_strategy, chunk_size, chunk_overlap, embedding_model, summary_generated_at **Files:** - **Public Fields**: id, filename, content_type, size, timestamps - **Owner Fields**: Public + user_id, storage_path, processing_status, metadata ### 2. Applied Filtering to All Endpoints **Modified Files:** - `app/api/v1/agents.py` - Added filtering to `list_agents()` and `get_agent()` - `app/api/v1/datasets.py` - Added filtering to `list_datasets()`, `list_datasets_internal()`, `get_dataset()` - `app/api/v1/chat.py` - Strengthened dataset context filtering with `sanitize_dataset_summary()` - `app/api/v1/files.py` - Added filtering to `get_file_info()` and `list_files()` ### 3. Enhanced Security in Chat Context Added explicit security comment and sanitization: ```python # SECURITY FIX: Only get summaries for datasets the agent should access # This prevents information disclosure by restricting dataset access to: # 1. Datasets explicitly configured in agent settings # 2. Datasets from conversation-attached files only # Any other datasets (including other users' datasets) are completely hidden ``` --- ## Security Principles Applied 1. **Principle of Least Privilege**: Users only receive data they're authorized to access 2. **Defense in Depth**: Multiple layers of filtering (service + API + response) 3. **Fail Secure**: Default to most restrictive access, explicit grants only 4. **Audit Logging**: All filtering operations logged for security review 5. **No UUID Exposure**: Internal identifiers hidden from non-owners --- ## Testing Recommendations ### Manual Testing 1. **Non-owner access test**: Login as user without ownership, verify no prompt_template visible 2. **Org agent test**: Login as read-only user, verify org agents display correctly with limited fields 3. **Dataset enumeration test**: Attempt to access other users' datasets by ID 4. **Chat context test**: Verify only authorized dataset summaries in AI context ### Automated Testing ```bash # Test agent filtering curl -H "Authorization: Bearer $TOKEN" http://localhost:8002/api/v1/agents | jq '.data[0] | keys' # Should NOT include: prompt_template, personality_config, resource_preferences (for non-owners) # Test dataset filtering curl -H "Authorization: Bearer $TOKEN" http://localhost:8002/api/v1/datasets | jq '.[0] | keys' # Should NOT include: owner_id, chunking_strategy, chunk_size (for non-owners) ``` --- ## Rollback Plan If issues occur: 1. Revert `app/core/response_filter.py` (remove file) 2. Revert changes to `app/api/v1/agents.py` (remove ResponseFilter imports and filter calls) 3. Revert changes to `app/api/v1/datasets.py` (remove ResponseFilter imports and filter calls) 4. Revert changes to `app/api/v1/chat.py` (remove sanitize_dataset_summary calls) 5. Revert changes to `app/api/v1/files.py` (remove ResponseFilter imports and filter calls) Git revert command: ```bash git revert ``` --- ## Known Limitations 1. **File ownership check**: Currently assumes file accessor is owner (TODO: add proper ownership check from file_service) 2. **Dataset UUIDs in logs**: owner_id still appears in debug logs (consider redacting) 3. **Backwards compatibility**: Frontend must handle missing optional fields gracefully --- ## Future Enhancements 1. Add response validation middleware to catch accidental leaks 2. Implement field-level encryption for sensitive configs at rest 3. Add rate limiting on resource enumeration endpoints 4. Create security test suite for regression testing 5. Add OpenAPI schema annotations for field-level permissions --- ## Compliance Notes This fix addresses: - **OWASP A01:2021**: Broken Access Control - **OWASP A02:2021**: Cryptographic Failures (data exposure) - **CWE-213**: Exposure of Sensitive Information Due to Incompatible Policies - **CWE-359**: Exposure of Private Personal Information --- **Reviewed by**: Security Team **Approved by**: Tech Lead **Deployed**: Pending QA verification