Just a few notes to myself.
I might expand on this at a later date.
1. Assume that all the links are broken to begin with.
2. Assume all websites are slow.
3. Assume that you have more bandwidth.
4. When running processes ensure you have some form of mutex lock or file lock or socket bind to prevent other instances of the process running.
5. You will not find bugs in your crawler untill you crawl over 50,000 documents and by then it is too late.
6. Be patient.
Number 4 just came and bit me in the arse tonight. Will not make that mistake again.
Published on Thursday 11th February 2010
boyter
You can, but if you make that assumption and then you have the opposite being true you bring down their site very quickly.
I will expand on this I think. Another point to add is dont assume timeouts will save you.
"Eternal life or your money back!" ![]()
TeX
Does that make it a trade off?
To do it quickly you need a heap of bandwidth, but with a heap of bandwidth you may DoS the server..
boyter
Yes. I found the easiest way is to do the secure server stratergy. Lock down everything and then open up when you need more access. So lock to a single process. Lock to certain number of URL's per process. Lock to specific times.
Writing a decent crawler is not easy ![]()
"Eternal life or your money back!" ![]()
Newer
TeX
I would have thought number 3 would have been the opposite?
ie. assume you have next to zero bandwidth for speed reasons