Wednesday, June 3, 2015

.NET Crawler Harvest

Alexander Nyquist ( http://nyqui.st/ ) made a cool crawler-component I used once... the solution worked quite good for me...

http://nyqui.st/harvest-a-c-multithreaded-web-crawler
https://github.com/alexandernyquist/Harvest/

in my solution I wrote the page to the filesystem using the OnPageDownloaded and wrote some code for filtering items which redirect to a login page. Here a decision has to be made about using a white-list or a black-list approach. For full-text search purpose media content can be filtered too.

cheers,
Daniel