Apply SafeSearch by default across search providers (#763)
#718 reported Deep Research drifting into adult / spam URLs several rounds into a benign session ("research about https://bhagathgoud.com/ and what he doing currently"). The reporter's log showed Japanese adult sites being crawled even though the model was emitting normal queries like "Bhagath Goud LinkedIn" and "site:bhagathgoud.com". The model wasn't generating those URLs. Every provider call site constructed its params dict without a SafeSearch parameter, so the underlying HTTP backend (the duckduckgo-search library / DDG's HTML endpoint in this case) was free to surface "related search" / trending / spam recommendations that have nothing to do with the user's query. Per provider: - SearXNG: instance-dependent; many self-hosted instances default to safesearch=0. - Brave API: defaults to "off" for new API keys. - duckduckgo-search lib: defaults to "moderate", which still lets related-search recommendations and HTTP-backend fallback URLs surface trending non-English spam topics. - DDG HTML fallback (html.duckduckgo.com): no `kp` param, treated as off. - Google PSE: omitted `safe` is equivalent to off. - Serper: omitted `safe` proxies to Google with safe off. Since the bad URLs entered through the provider layer, not the model, the provider params are the right place to gate this. Changes: - src/settings.py: new `search_safesearch` setting with default "strict". Documented values ("strict" | "moderate" | "off") plus a few aliases ("on", "high", "0/1/2", "disabled", ...) so a hand-edited config doesn't silently fall through to off. - src/search/providers.py: - Add `_get_safesearch_level()` (canonical, normalizing) and `_safesearch_for(provider)` (per-provider param translation). - Thread the per-provider value into every params dict: SearXNG JSON, SearXNG language/engines fallbacks, SearXNG HTML, Brave, DDG library, DDG HTML fallback, Google PSE, Serper. - Tavily is left untouched — its API has no SafeSearch knob and its index already filters explicit content at ingest time. Behavior change for existing installs: default is now "strict", so explicit results get filtered across every supported provider without any user action. Users who deliberately want unfiltered results can set `search_safesearch` to "off" in Settings. No new dependencies, no schema migrations. Closes #718.
This commit is contained in:
@@ -55,6 +55,26 @@ DEFAULT_SETTINGS = {
|
||||
"search_fallback_chain": ["duckduckgo"],
|
||||
"search_url": "",
|
||||
"search_result_count": 5,
|
||||
# SafeSearch level applied to every provider that exposes one.
|
||||
# "strict" — block adult / explicit results (default; matches what users
|
||||
# expect from a research tool and avoids unrelated NSFW URLs
|
||||
# bleeding in via provider "related" / spam recommendations)
|
||||
# "moderate" — provider-default behavior (filter explicit but allow
|
||||
# suggestive content)
|
||||
# "off" — disable filtering entirely (advanced users only)
|
||||
#
|
||||
# Providers that honor this setting (translated to each provider's native
|
||||
# param in src/search/providers.py:_safesearch_for):
|
||||
# SearXNG safesearch=0/1/2 (JSON API, HTML scrape, news fallback)
|
||||
# Brave Search safesearch=off/moderate/strict
|
||||
# DuckDuckGo safesearch=off/moderate/on (library + HTML kp param)
|
||||
# Google PSE safe=active (omitted for "off"; PSE has no middle tier)
|
||||
# Serper.dev safe=active (omitted for "off"; proxies Google's `safe`)
|
||||
# Providers NOT touched: Tavily (no SafeSearch knob; filters at index time)
|
||||
# and any custom backend reached via search_url — they keep whatever the
|
||||
# backend itself decides, so operators stay in control of self-hosted /
|
||||
# niche search instances.
|
||||
"search_safesearch": "strict",
|
||||
"brave_api_key": "",
|
||||
"google_pse_key": "",
|
||||
"google_pse_cx": "",
|
||||
|
||||
Reference in New Issue
Block a user