#718 reported Deep Research drifting into adult / spam URLs several
rounds into a benign session ("research about https://bhagathgoud.com/
and what he doing currently"). The reporter's log showed Japanese
adult sites being crawled even though the model was emitting normal
queries like "Bhagath Goud LinkedIn" and "site:bhagathgoud.com".
The model wasn't generating those URLs. Every provider call site
constructed its params dict without a SafeSearch parameter, so the
underlying HTTP backend (the duckduckgo-search library / DDG's HTML
endpoint in this case) was free to surface "related search" /
trending / spam recommendations that have nothing to do with the
user's query. Per provider:
- SearXNG: instance-dependent; many self-hosted instances default
to safesearch=0.
- Brave API: defaults to "off" for new API keys.
- duckduckgo-search lib: defaults to "moderate", which still lets
related-search recommendations and HTTP-backend fallback URLs
surface trending non-English spam topics.
- DDG HTML fallback (html.duckduckgo.com): no `kp` param, treated
as off.
- Google PSE: omitted `safe` is equivalent to off.
- Serper: omitted `safe` proxies to Google with safe off.
Since the bad URLs entered through the provider layer, not the
model, the provider params are the right place to gate this.
Changes:
- src/settings.py: new `search_safesearch` setting with default
"strict". Documented values ("strict" | "moderate" | "off") plus
a few aliases ("on", "high", "0/1/2", "disabled", ...) so a
hand-edited config doesn't silently fall through to off.
- src/search/providers.py:
- Add `_get_safesearch_level()` (canonical, normalizing) and
`_safesearch_for(provider)` (per-provider param translation).
- Thread the per-provider value into every params dict:
SearXNG JSON, SearXNG language/engines fallbacks, SearXNG HTML,
Brave, DDG library, DDG HTML fallback, Google PSE, Serper.
- Tavily is left untouched — its API has no SafeSearch knob and
its index already filters explicit content at ingest time.
Behavior change for existing installs: default is now "strict", so
explicit results get filtered across every supported provider
without any user action. Users who deliberately want unfiltered
results can set `search_safesearch` to "off" in Settings. No new
dependencies, no schema migrations.
Closes#718.
The 600s wall-clock cap in research_handler.start_research was too short
for local / edge LLMs to finish a deep-research synthesis — long
extraction passes plus a slow final report routinely blew past 10
minutes and the run was killed with partial results.
Introduce research_run_timeout_seconds (default 1800s = 30 min) in
DEFAULT_SETTINGS and resolve it at start_research entry when the caller
hasn't pinned hard_timeout. Bound the resolved value at [60, 86400] so a
misconfigured settings.json can't either disable the safety net or
explode into a multi-day hang. Existing call sites in research_routes.py
and chat_routes.py keep working unchanged — they don't pass hard_timeout
and now pick up the new default.
Closes#595.
* feat(web-fetch): add web_fetch tool to read a specific URL's content
* test(web-fetch): add SSRF coverage and fail closed on empty DNS resolution
Add explicit SSRF regression tests for the web_fetch path covering
loopback, private LAN ranges, link-local/metadata, IPv6 private/local,
redirect-into-private, and unsupported schemes. Harden _public_http_url
to fail closed when a hostname resolves to no addresses.