odysseus

Author	SHA1	Message	Date
Kenny Van de Maele	8bfd79fe8e	chore: deduplicate src/search modules (cache, content, query) into shims (#2506 ) * chore: dedupe src/search/cache.py into a re-export shim src/search/cache.py was a byte-identical copy of services/search/cache.py. Convert it to a sys.modules alias of the canonical services module (matching src/search/core.py, providers.py, ranking.py) so the two cannot drift, and add an identity assertion to test_search_module_consolidation.py. content.py and query.py are intentionally left as-is: the copies have drifted and services lacks fixes that src has, so they need services reconciled first before they can be shimmed safely. * chore: dedupe src/search content.py and query.py into shims Convert src/search/content.py and query.py to sys.modules aliases of the canonical services/search/* (matching cache.py, core.py, providers.py, ranking.py) so the duplicate copies cannot drift. Repoint the two tests that were coupled to the src-copy internals onto the canonical services surface (behaviour is equivalent): - test_src_search_query_nonstring.py: import services.search.query instead of loading the src file by path. - test_security_regressions.py::test_web_fetch_guard_blocks_redirect_into_private: mock httpx.get (services uses the module-level get, not httpx.Client) and assert on the canonical 'Blocked' message. Drop the now-redundant [src_content, service_content] parametrization in test_search_content_extraction_parity.py and test_search_content_url_guards.py (after the shim both params are the same object); add content/query identity assertions to test_search_module_consolidation.py.	2026-06-04 18:10:55 +02:00
Wes Huber	93b3e108a6	fix: re-export _SPORTS_HINT_RE from search ranking shim (#2273 ) The compatibility re-export shim at src/search/ranking.py forgot _SPORTS_HINT_RE, so tests importing src.search.ranking raised AttributeError on the [src] parametrize variant. Fixes #1995 Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-06-04 14:24:53 +01:00
Vykos	aaef6b1c49	fix(search): align content URL guards * Stabilize full test collection * Align search content URL guards	2026-06-04 00:34:06 +01:00
pewdiepie-archdaemon	6861c41580	Reapply "Merge branch 'main' of github.com:pewdiepie-archdaemon/odysseus" This reverts commit `cc8fe2f6e3`.	2026-06-03 22:47:00 +09:00
pewdiepie-archdaemon	cc8fe2f6e3	Revert "Merge branch 'main' of github.com:pewdiepie-archdaemon/odysseus" This reverts commit `8161c1253d`, reversing changes made to `8c2705b42a`.	2026-06-03 22:46:19 +09:00
Alexandre Teixeira	a75dd4a231	fix(search): apply recency UTC fix to live ranking module	2026-06-03 12:49:32 +01:00
Afonso Coutinho	b55c970ec5	fix: sports-hint ranking penalty fires on 'transport'/'passport' substrings (#1473 ) * fix: sports-hint ranking penalty fires on 'transport'/'passport' substrings * Apply word-boundary sports-hint fix to src/search/ranking.py as well	2026-06-03 14:23:52 +09:00
red person	b409b20940	Handle non-string src search queries (#1646 )	2026-06-03 14:11:02 +09:00
Afonso Coutinho	f62d6ea3d7	fix: research query misclassifies 'whatsapp'/'however' as questions (#1247 ) * fix: detect question words as whole words, not prefixes * fix: same question-word prefix bug in the services search copy * test: question-word detection rejects prefix lookalikes	2026-06-03 01:10:06 +09:00
Afonso Coutinho	203c4d83df	fix: search analytics crashes recording when the JSON file predates a counter (#1224 ) * refactor: single _default_analytics() instead of duplicated default dicts * fix: merge analytics defaults so an old/partial file doesn't KeyError on record * test: analytics load merges defaults; record survives a partial file	2026-06-03 00:26:37 +09:00
lekt8	975fd42e32	fix: rank recency by UTC, not local time (#1116 ) (#1234 ) src/search/ranking.py computed result age as `(datetime.now() - dt).days`, where `dt` is parsed from a UTC-style published date with no timezone. Using local `datetime.now()` skewed the age by the host's UTC offset (off-by-up-to-a-day near boundaries), and was a latent crash: once neighbouring code becomes timezone-aware the naive/aware subtraction raises TypeError (the landmine called out in #1116). Recency is now measured against naive UTC. The scoring is also lifted out of the rank_search_results closure into a module-level, time-injectable `recency_score` so it's unit-testable, and `_utcnow_naive()` avoids `datetime.utcnow()` (removed in Python 3.14). Covered by tests/test_search_ranking_recency.py (5 cases); the existing tests/test_search_ranking.py still passes. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 00:18:15 +09:00
Afonso Coutinho	2e2da2aefe	fix: extract_statistics drops large numbers and trailing % signs (#1153 ) * fix: extract_statistics misses comma-less numbers and drops trailing % * fix: same extract_statistics number/percent bug in services copy * test: extract_statistics captures full numbers and percent signs	2026-06-02 22:35:30 +09:00
Afonso Coutinho	2b2943a7b7	fix: extract_quotes accepts mismatched opening/closing quotes (#1113 ) * fix: only extract quotes whose closing quote matches the opening one * fix: same mismatched-quote bug in the services search copy * test: extract_quotes requires matching open/close quotes	2026-06-02 22:34:52 +09:00
ghreprimand	c075abce5d	Search: consolidate core and provider implementations Co-authored-by: ghreprimand <203024559+ghreprimand@users.noreply.github.com>	2026-06-02 21:02:26 +09:00
mist	fca8d68aba	Match host, not substring, when resolving DuckDuckGo redirects (#886 ) _resolve_ddg_redirect (the DuckDuckGo /l/?uddg= redirect resolver used on every HTML-fallback result href) gated on `"duckduckgo.com" in parsed.hostname`. That substring test also matches look-alike hosts like `duckduckgo.com.evil.com` and `notduckduckgo.com`, so a result link on such a host would be silently rewritten to its embedded `uddg` target. Same substring-vs-hostname pitfall fixed for provider detection in `54ecfa3`. Match the host properly: exactly `duckduckgo.com` or a `.duckduckgo.com` subdomain. Genuine redirects (`//duckduckgo.com/l/...`, and relative `/l/...` hrefs resolved against `html.duckduckgo.com`) keep working. The resolver was a closure inside duckduckgo_search; lifted it (plus the new _is_duckduckgo_host helper) to module scope so it can be unit-tested directly. Adds tests/test_ddg_redirect_resolution.py (red on the look-alike case before this change, green after).	2026-06-02 12:25:56 +09:00
Afonso Coutinho	9d8eebfa63	fix: source thumbnails dropped for http-only og:image URLs (#667 ) * fix: accept http (not just https) og:image URLs for source thumbnails * test: og:image extraction accepts http and skips relative/svg	2026-06-02 11:41:33 +09:00
tanmayraut45	1cc2e90ac0	Apply SafeSearch by default across search providers (#763 ) #718 reported Deep Research drifting into adult / spam URLs several rounds into a benign session ("research about https://bhagathgoud.com/ and what he doing currently"). The reporter's log showed Japanese adult sites being crawled even though the model was emitting normal queries like "Bhagath Goud LinkedIn" and "site:bhagathgoud.com". The model wasn't generating those URLs. Every provider call site constructed its params dict without a SafeSearch parameter, so the underlying HTTP backend (the duckduckgo-search library / DDG's HTML endpoint in this case) was free to surface "related search" / trending / spam recommendations that have nothing to do with the user's query. Per provider: - SearXNG: instance-dependent; many self-hosted instances default to safesearch=0. - Brave API: defaults to "off" for new API keys. - duckduckgo-search lib: defaults to "moderate", which still lets related-search recommendations and HTTP-backend fallback URLs surface trending non-English spam topics. - DDG HTML fallback (html.duckduckgo.com): no `kp` param, treated as off. - Google PSE: omitted `safe` is equivalent to off. - Serper: omitted `safe` proxies to Google with safe off. Since the bad URLs entered through the provider layer, not the model, the provider params are the right place to gate this. Changes: - src/settings.py: new `search_safesearch` setting with default "strict". Documented values ("strict" \| "moderate" \| "off") plus a few aliases ("on", "high", "0/1/2", "disabled", ...) so a hand-edited config doesn't silently fall through to off. - src/search/providers.py: - Add `_get_safesearch_level()` (canonical, normalizing) and `_safesearch_for(provider)` (per-provider param translation). - Thread the per-provider value into every params dict: SearXNG JSON, SearXNG language/engines fallbacks, SearXNG HTML, Brave, DDG library, DDG HTML fallback, Google PSE, Serper. - Tavily is left untouched — its API has no SafeSearch knob and its index already filters explicit content at ingest time. Behavior change for existing installs: default is now "strict", so explicit results get filtered across every supported provider without any user action. Users who deliberately want unfiltered results can set `search_safesearch` to "off" in Settings. No new dependencies, no schema migrations. Closes #718.	2026-06-02 11:34:32 +09:00
mist	5ebe9ee67a	Fix invalidate_search_cache using a key that never matches stored entries (#852 ) invalidate_search_cache(query) built its cache key as generate_cache_key(f"{query}\|10\|None"), but the write path (searxng_search_results) replaces the caller's default count of 10 with the admin-configured _get_result_count() (default 5) before building the key. So a default search for "X" is cached under "X\|5\|None", while invalidation looked for "X\|10\|None" — they never match, and invalidate_search_cache silently failed to remove anything in the default configuration, violating its docstring ("invalidate ... just the given query"). Derive the count from _get_result_count() so invalidation matches the default-search entry the write path actually stores. The same bug (and fix) applies to both the src/search and services/search copies. Note: time-filtered variants (e.g. "X\|5\|day") still aren't reachable from a query-only signature, since cache keys are opaque SHA-256 hashes with no stored query; clearing those would need a broader cache-index redesign and is out of scope here. Adds tests/test_search_cache_invalidation.py covering the default-count case.	2026-06-02 10:53:33 +09:00
BSG-Walter	c0466274ed	fix: resolve DuckDuckGo redirect URLs in HTML fallback search The DuckDuckGo HTML fallback returns redirect URLs (//duckduckgo.com/l/?uddg=...) instead of actual page URLs. This caused fetch_webpage_content() to reject them instantly because _public_http_url() requires an http/https scheme, making search results unfetchable in deep research mode. Added _resolve_url() to: - Convert protocol-relative URLs to absolute (https:) - Convert path-relative URLs to absolute - Extract the real URL from DuckDuckGo's /l/?uddg= redirect parameters	2026-06-01 19:42:01 -03:00
Afonso Coutinho	9b1acf6612	Fix year extraction in research queries * fix: extract full year in research query entities, not just the century * fix: same year capture-group bug in the services search copy * test: research query extracts the full year	2026-06-01 23:09:41 +09:00
Rifqi Akram	5b1e56407b	Add SSRF-guarded web fetch agent tool * feat(web-fetch): add web_fetch tool to read a specific URL's content * test(web-fetch): add SSRF coverage and fail closed on empty DNS resolution Add explicit SSRF regression tests for the web_fetch path covering loopback, private LAN ranges, link-local/metadata, IPv6 private/local, redirect-into-private, and unsupported schemes. Harden _public_http_url to fail closed when a hostname resolves to no addresses.	2026-06-01 16:57:28 +09:00
pewdiepie-archdaemon	e5c99a5eee	Odysseus v1.0	2026-05-31 23:58:26 +09:00

22 Commits