Minimiser/Maximiser Menu

Minimiser/Maximiser Tumbler (beta)

Minimiser/Maximiser Yell Box
Josh
[boyter.org] - great walk

Josh
Must blog again soon

boyter
I never really played it either... I might this time around though.

Josh
It was the multiplayer aspect for me. Always great fun.

nisch
I could never understand the attraction of the original n64 game. hopeless, in my opinion.

Poid
It's gonna be awesome..60 or so days to go

Josh
November. GoldenEye. Online multiplayer. Already excited

boyter
Thats what health insurance is for I recon... the feeling not so sure though...


Name
Content
Enter fatty

Minimiser/Maximiser RandPic (Beta)

Minimiser/Maximiser Login

Minimiser/Maximiser Advertisement
Things to keep in mind when running a web crawler
Icon

Just a few notes to myself.

I might expand on this at a later date.

1. Assume that all the links are broken to begin with.
2. Assume all websites are slow.
3. Assume that you have more bandwidth.
4. When running processes ensure you have some form of mutex lock or file lock or socket bind to prevent other instances of the process running.
5. You will not find bugs in your crawler untill you crawl over 50,000 documents and by then it is too late.
6. Be patient.

Number 4 just came and bit me in the arse tonight. Will not make that mistake again.

Published on Thursday 11th February 2010

Pemalink Icon Comments Icon Comments (4)

TeX
I would have thought number 3 would have been the opposite?
ie. assume you have next to zero bandwidth for speed reasons


boyter
You can, but if you make that assumption and then you have the opposite being true you bring down their site very quickly.

I will expand on this I think. Another point to add is dont assume timeouts will save you.

"Eternal life or your money back!"


TeX
Does that make it a trade off?
To do it quickly you need a heap of bandwidth, but with a heap of bandwidth you may DoS the server..


boyter
Yes. I found the easiest way is to do the secure server stratergy. Lock down everything and then open up when you need more access. So lock to a single process. Lock to certain number of URL's per process. Lock to specific times.

Writing a decent crawler is not easy Sad

"Eternal life or your money back!"



Enter fatty

© Ben Boyter 2007 Powered by Hamster Wheels Ver 3.5