Apply SafeSearch by default across search providers (#763)

#718 reported Deep Research drifting into adult / spam URLs several
rounds into a benign session ("research about https://bhagathgoud.com/
and what he doing currently"). The reporter's log showed Japanese
adult sites being crawled even though the model was emitting normal
queries like "Bhagath Goud LinkedIn" and "site:bhagathgoud.com".

The model wasn't generating those URLs. Every provider call site
constructed its params dict without a SafeSearch parameter, so the
underlying HTTP backend (the duckduckgo-search library / DDG's HTML
endpoint in this case) was free to surface "related search" /
trending / spam recommendations that have nothing to do with the
user's query. Per provider:

- SearXNG: instance-dependent; many self-hosted instances default
  to safesearch=0.
- Brave API: defaults to "off" for new API keys.
- duckduckgo-search lib: defaults to "moderate", which still lets
  related-search recommendations and HTTP-backend fallback URLs
  surface trending non-English spam topics.
- DDG HTML fallback (html.duckduckgo.com): no `kp` param, treated
  as off.
- Google PSE: omitted `safe` is equivalent to off.
- Serper: omitted `safe` proxies to Google with safe off.

Since the bad URLs entered through the provider layer, not the
model, the provider params are the right place to gate this.

Changes:

- src/settings.py: new `search_safesearch` setting with default
  "strict". Documented values ("strict" | "moderate" | "off") plus
  a few aliases ("on", "high", "0/1/2", "disabled", ...) so a
  hand-edited config doesn't silently fall through to off.
- src/search/providers.py:
  - Add `_get_safesearch_level()` (canonical, normalizing) and
    `_safesearch_for(provider)` (per-provider param translation).
  - Thread the per-provider value into every params dict:
    SearXNG JSON, SearXNG language/engines fallbacks, SearXNG HTML,
    Brave, DDG library, DDG HTML fallback, Google PSE, Serper.
  - Tavily is left untouched — its API has no SafeSearch knob and
    its index already filters explicit content at ingest time.

Behavior change for existing installs: default is now "strict", so
explicit results get filtered across every supported provider
without any user action. Users who deliberately want unfiltered
results can set `search_safesearch` to "off" in Settings. No new
dependencies, no schema migrations.

Closes #718.
This commit is contained in:
tanmayraut45
2026-06-02 08:04:32 +05:30
committed by GitHub
parent eff762cdd9
commit 1cc2e90ac0
2 changed files with 85 additions and 5 deletions

View File

@@ -55,6 +55,26 @@ DEFAULT_SETTINGS = {
"search_fallback_chain": ["duckduckgo"],
"search_url": "",
"search_result_count": 5,
# SafeSearch level applied to every provider that exposes one.
# "strict" — block adult / explicit results (default; matches what users
# expect from a research tool and avoids unrelated NSFW URLs
# bleeding in via provider "related" / spam recommendations)
# "moderate" — provider-default behavior (filter explicit but allow
# suggestive content)
# "off" — disable filtering entirely (advanced users only)
#
# Providers that honor this setting (translated to each provider's native
# param in src/search/providers.py:_safesearch_for):
# SearXNG safesearch=0/1/2 (JSON API, HTML scrape, news fallback)
# Brave Search safesearch=off/moderate/strict
# DuckDuckGo safesearch=off/moderate/on (library + HTML kp param)
# Google PSE safe=active (omitted for "off"; PSE has no middle tier)
# Serper.dev safe=active (omitted for "off"; proxies Google's `safe`)
# Providers NOT touched: Tavily (no SafeSearch knob; filters at index time)
# and any custom backend reached via search_url — they keep whatever the
# backend itself decides, so operators stay in control of self-hosted /
# niche search instances.
"search_safesearch": "strict",
"brave_api_key": "",
"google_pse_key": "",
"google_pse_cx": "",