Skip to content

#1899 OkHttp protocol fails to cancel request after topology.message.timeout.secs#1900

Open
sebastian-nagel wants to merge 1 commit intoapache:mainfrom
sebastian-nagel:1899-okhttp-timeout
Open

#1899 OkHttp protocol fails to cancel request after topology.message.timeout.secs#1900
sebastian-nagel wants to merge 1 commit intoapache:mainfrom
sebastian-nagel:1899-okhttp-timeout

Conversation

@sebastian-nagel
Copy link
Copy Markdown
Contributor

To force cancellation of the request:

Additional changes:

  • set the TrimmedReason to TIME if OkHttp throws an InterruptedIOException
  • log the reason why the response is trimmed
  • add type parameter to MutableObject's
  • replace deprecated method calls getValue()

So far, the solution is only verified using the Protocol main method:

$> java -cp .../stormcrawler-core-3.5.2-SNAPSHOT.jar:... \
     org.apache.stormcrawler.protocol.okhttp.HttpProtocol \
     -f /tmp/crawler-conf-test.yaml http://cbhjhlccfkqdpknyu.org/
...
[main] INFO org.apache.stormcrawler.protocol.okhttp.HttpProtocol - Using protocol versions: [h2, http/1.1]
[main] INFO org.apache.stormcrawler.protocol.okhttp.HttpProtocol - Using connection pool with max. 5 idle connections and 300 sec. connection keep-alive time
[Thread-0] WARN org.apache.stormcrawler.protocol.okhttp.HttpProtocol - HTTP content trimmed to 10 (reason: TIME)
[Thread-0] WARN crawlercommons.robots.SimpleRobotRulesParser - Problem processing robots.txt for http://cbhjhlccfkqdpknyu.org/
[Thread-0] WARN crawlercommons.robots.SimpleRobotRulesParser -   Unknown line in robots.txt file (size 10): DQEPigDriE
[Thread-0] WARN org.apache.stormcrawler.protocol.okhttp.HttpProtocol - HTTP content trimmed to 10 (reason: TIME)
http://cbhjhlccfkqdpknyu.org/
robots allowed: true
robots requests: 1
sitemaps identified: 0
date: Thu, 07 May 2026 14:02:24 GMT
server: nginx/1.21.6
transfer-encoding: chunked
_protocol_versions_: http/1.1
metrics.dns.resolution.msec: 4
http.trimmed.reason: time
keep-alive: timeout=20
_request.headers_: GET / HTTP/1.1
User-Agent: MyTestBot/3.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
Accept-Encoding: zstd, br, gzip
Host: cbhjhlccfkqdpknyu.org
Connection: Keep-Alive


http.trimmed: true
_request.time_: 1778162544171
content-type: application/octet-stream
connection: keep-alive
_response.ip_: 216.218.185.162
_response.headers_: HTTP/1.1 200 OK
Server: nginx/1.21.6
Date: Thu, 07 May 2026 14:02:24 GMT
Content-Type: application/octet-stream
Transfer-Encoding: chunked
Connection: keep-alive
Keep-Alive: timeout=20



status code: 200
content length: 10
fetched in : 60002 msec

topology.message.timeout.secs, fixes apache#1899

- set call timeout to topology.message.timeout.secs (if not -1)
- add type parameter to MutableObject
@rzo1 rzo1 added this to the 3.6.0 milestone May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants