Skip to content

fix: prevent plan polling from causing intermittent 500s#86

Open
black-pwq wants to merge 12 commits intoketing:mainfrom
black-pwq:fix/plan-500-error
Open

fix: prevent plan polling from causing intermittent 500s#86
black-pwq wants to merge 12 commits intoketing:mainfrom
black-pwq:fix/plan-500-error

Conversation

@black-pwq
Copy link
Copy Markdown
Contributor

Summary

Fix the intermittent 500 error seen after entering the Plan stage.

This change keeps the background polling path from blocking the event loop and exhausting request threads, which could surface as:

  • GET /api/projects/{id}/plans returning 500
  • RuntimeError: can't start new thread

Changes

  • run project polling through a worker wrapper submitted via run_in_executor
  • avoid reusing the request/session-bound SQLAlchemy state inside the executor worker
  • refresh project state after polling before computing the next schedule
  • update the polling executor test to assert the worker wrapper is dispatched
  • add a regression test covering GET /api/projects/{id}/plans staying responsive during background polling

Verification

  • added regression coverage for the Plan 500 scenario
  • added executor-dispatch coverage for polling loop behavior

Closes #51

black-pwq added 10 commits May 1, 2026 14:51
…om anyio

The test test_ensure_repo_sync_retries_retryable_fetch_failure was
asserting that time.sleep is called exactly once during a retryable
fetch failure, but it was failing with:

  AssertionError: Expected 'sleep' to have been called once.
  Called 32 times.

Root cause
----------
The test patched 'services.git_service.time.sleep', which replaces the
'sleep' attribute on the shared time module object itself.  Because
patch operates on the module-level name binding, this ends up replacing
time.sleep globally for the duration of the test.  The anyio library
(used by pytest-anyio, which is loaded as a plugin) runs an internal
thread pool that polls with exponential back-off using time.sleep
directly.  Those 31 extra calls (0.001, 0.002, 0.004, ... 0.05 × 25)
belong to anyio's scheduler, not to git_service, yet they were captured
by the same mock, causing the assertion to fail.

Fix
---
Introduce a module-level alias in git_service:

    _sleep = time.sleep

and use _sleep(delay) inside _retry_git_operation instead of calling
time.sleep(delay) directly.

The test now patches 'services.git_service._sleep' instead of
'services.git_service.time.sleep'.  This replaces only the name binding
inside the git_service module namespace; time.sleep elsewhere in the
process (including inside anyio) is unaffected.  The mock therefore
captures exactly the one call made by the retry logic, and
assert_called_once() passes reliably.
Replace the manual venv + pip workflow with uv in both the English
and Chinese README:

- Remove: python3.12 -m venv .venv && source .venv/bin/activate
- Remove: pip install -r requirements-dev.txt
- Replace: uvicorn main:app ... → uv run uvicorn main:app ...
- Add a note explaining that uv auto-creates the venv from pyproject.toml
  and that dev deps can be installed with
@black-pwq black-pwq requested a review from keting as a code owner May 6, 2026 12:36
@keting
Copy link
Copy Markdown
Owner

keting commented May 6, 2026

主线修复(polling_loop 用 run_in_executor 调度 poll_project,worker 内开新 session 不复用 scheduler 的)方向正确,回归测试
test_get_plans_remains_responsive_during_background_polling 也实打实复现了 issue #51 的故障路径并验证了修复,赞。

合并前请处理一个 Blocker:

  • test_polling_loop_remains_responsive_while_poll_project_runs 在干净环境下 5/5 稳定失败(uv run python -m pytest
    tests/test_polling_service.py -v)。根因:测试末尾 await asyncio.sleep(0) 让出的时间不足以让 call_soon_threadsafe 排进来的
    side_channel Task 真正跑起来 —— 我加 print 实测,side_channel 实际是在 test method 返回之后才执行,断言时 Event 还是未 set。建议把
    tests/test_polling_service.py:617 的 await asyncio.sleep(0) 替换为 await asyncio.wait_for(event_loop_ran_concurrently.wait(),
    timeout=1.0),明确等待 Event 被 set,超时再失败。修完后请在 CI 标准的 Python 3.12 + uv run python -m pytest tests/ -v
    下复跑确认稳定通过。

另一个非阻塞建议(Minor):

  • get_current_user / list_plans 改成 async 这两行:现在 polling_loop 已经不阻塞 event loop 了,threadpool
    也不会被打爆,所以这两行实际上不再是修 issue 创建项目进入规划(Plan)阶段后偶发500内部服务器错误 #51 的必需路径;而且函数体内仍是同步 SQLAlchemy 调用,相当于把 SQL I/O 直接搬到 event
    loop 上,跟仓库里其它 router endpoint 的同步范式不一致。可以保留(开销小),但建议在 PR 描述里明确这是 belt-and-suspenders
    而非必要修复,避免后人误以为"async 才是正解"扩散到其它 endpoint。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

创建项目进入规划(Plan)阶段后偶发500内部服务器错误

2 participants