dennisgorelik: 2020-06-13 in my home office (Default)
[personal profile] dennisgorelik
Our web crawler hanged. Again.
It looks like it hangs with the frequency of about 1 hang per 1 million web page downloads.
Does not timeout. Does not crash. Just hangs.
Fortunately, other threads in our PostJobFreeService keep running.
(Until AngleSharp HTML parser would crash the whole PostJobFreeService on some weird HTML page, of course.)

Unfortunately, crawler hang is not reproducible: the same page can be downloaded without any problems on the next attempt. Or just in a browser.
But once in 1M downloads something weird happens: our crawler successfully passes HTTP handshake with the remote web server (so no HTTP connection timeout), but then hangs.

For our crawler we are using standard HttpWebRequest class from .NET framework.
Should we crawl with something else?

Or is it inevitable that web crawler would hang eventually and our watchdog should simply restart corresponding thread?

Discussion in Livejournal: http://dennisgorelik.livejournal.com/124693.html

Date: 2017-01-29 05:22 am (UTC)
juan_gandhi: (Default)
From: [personal profile] juan_gandhi
It's not your crawler, it's their side suddenly deciding you are a bot.

Profile

dennisgorelik: 2020-06-13 in my home office (Default)
Dennis Gorelik

June 2025

S M T W T F S
1234 567
891011121314
15161718192021
22232425262728
2930     

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 8th, 2025 12:05 am
Powered by Dreamwidth Studios
OSZAR »