Web crawler hangs
Jan. 28th, 2017 01:01 pmOur web crawler hanged. Again.
It looks like it hangs with the frequency of about 1 hang per 1 million web page downloads.
Does not timeout. Does not crash. Just hangs.
Fortunately, other threads in our PostJobFreeService keep running.
(Until AngleSharp HTML parser would crash the whole PostJobFreeService on some weird HTML page, of course.)
Unfortunately, crawler hang is not reproducible: the same page can be downloaded without any problems on the next attempt. Or just in a browser.
But once in 1M downloads something weird happens: our crawler successfully passes HTTP handshake with the remote web server (so no HTTP connection timeout), but then hangs.
For our crawler we are using standard HttpWebRequest class from .NET framework.
Should we crawl with something else?
Or is it inevitable that web crawler would hang eventually and our watchdog should simply restart corresponding thread?
Discussion in Livejournal: http://dennisgorelik.livejournal.com/124693.html
It looks like it hangs with the frequency of about 1 hang per 1 million web page downloads.
Does not timeout. Does not crash. Just hangs.
Fortunately, other threads in our PostJobFreeService keep running.
(Until AngleSharp HTML parser would crash the whole PostJobFreeService on some weird HTML page, of course.)
Unfortunately, crawler hang is not reproducible: the same page can be downloaded without any problems on the next attempt. Or just in a browser.
But once in 1M downloads something weird happens: our crawler successfully passes HTTP handshake with the remote web server (so no HTTP connection timeout), but then hangs.
For our crawler we are using standard HttpWebRequest class from .NET framework.
Should we crawl with something else?
Or is it inevitable that web crawler would hang eventually and our watchdog should simply restart corresponding thread?
Discussion in Livejournal: http://dennisgorelik.livejournal.com/124693.html