recommend a webcrawler?


Recommended Posts

hello, I have had a quick search to try and find a web crawler which will fetch all the data on a page so my caching proxy can store it for users when they use the system later on.

 

I have seen a few theories and programs out there but I want to know what one will be the best? or should I just do my own java one? all I want to do is grab a urls from a provided list and for the crawler to scan through the sites and which should then be auto-cached as the server will sit inbetween... any recommendations of ones people have used before? I dont want to be testing programs for hours so thought I would get a few reviews here :p

 

currently looking at: https://code.google.com/p/crawler4j/

Link to comment
Share on other sites

To be honest cache of content on proxy is pretty old school stuff, most sites are all dynamic generated stuff now.. So while you might be able to cache some images, etc. Most content is dynamic - caching it on your proxy is not going to provide too much info.

Also what you going to do cache the whole internet? Why would you cache something that nobody has requested yet for example - how do you know anyone will ever go there - ever!! So now you just used up bandwidth for nothing.

If you want to cache locally for user number 2 going to domainx.tld that is great - but I wouldn't want to grab domainx.tld until someone has actually requested to go there. Even then since domainx.tld most likely change in 10 minutes or even on the next hit because its dynamic - caching is not going to get you much.

Link to comment
Share on other sites

To be honest cache of content on proxy is pretty old school stuff, most sites are all dynamic generated stuff now.. So while you might be able to cache some images, etc. Most content is dynamic - caching it on your proxy is not going to provide too much info.

Also what you going to do cache the whole internet? Why would you cache something that nobody has requested yet for example - how do you know anyone will ever go there - ever!! So now you just used up bandwidth for nothing.

If you want to cache locally for user number 2 going to domainx.tld that is great - but I wouldn't want to grab domainx.tld until someone has actually requested to go there. Even then since domainx.tld most likely change in 10 minutes or even on the next hit because its dynamic - caching is not going to get you much.

 

this specific product needs cacheing and trust me cacheing does help even on the dynamic pages, the user group is very specific and the pages to be accessed will also be specific, sites are blocked, media is blocked dynamic content like news websites might change but they do not change that often, anything I can shave off a few seconds of time is worth doing. the pages are requested before use (as in before anyone uses the connection) just to give that few minutes saved at the start even if they are recached later because of being dynamic.

I have done so much research in to the viability of this and it is 100% needed :)

Link to comment
Share on other sites

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.