Sign in to follow this  
Followers 0
Original Poster

recommend a webcrawler?

6 posts in this topic

hello, I have had a quick search to try and find a web crawler which will fetch all the data on a page so my caching proxy can store it for users when they use the system later on.

 

I have seen a few theories and programs out there but I want to know what one will be the best? or should I just do my own java one? all I want to do is grab a urls from a provided list and for the crawler to scan through the sites and which should then be auto-cached as the server will sit inbetween... any recommendations of ones people have used before? I dont want to be testing programs for hours so thought I would get a few reviews here :p

 

currently looking at: https://code.google.com/p/crawler4j/

Share this post


Link to post
Share on other sites

To be honest cache of content on proxy is pretty old school stuff, most sites are all dynamic generated stuff now.. So while you might be able to cache some images, etc. Most content is dynamic - caching it on your proxy is not going to provide too much info.

Also what you going to do cache the whole internet? Why would you cache something that nobody has requested yet for example - how do you know anyone will ever go there - ever!! So now you just used up bandwidth for nothing.

If you want to cache locally for user number 2 going to domainx.tld that is great - but I wouldn't want to grab domainx.tld until someone has actually requested to go there. Even then since domainx.tld most likely change in 10 minutes or even on the next hit because its dynamic - caching is not going to get you much.

Share this post


Link to post
Share on other sites

To be honest cache of content on proxy is pretty old school stuff, most sites are all dynamic generated stuff now.. So while you might be able to cache some images, etc. Most content is dynamic - caching it on your proxy is not going to provide too much info.

Also what you going to do cache the whole internet? Why would you cache something that nobody has requested yet for example - how do you know anyone will ever go there - ever!! So now you just used up bandwidth for nothing.

If you want to cache locally for user number 2 going to domainx.tld that is great - but I wouldn't want to grab domainx.tld until someone has actually requested to go there. Even then since domainx.tld most likely change in 10 minutes or even on the next hit because its dynamic - caching is not going to get you much.

 

this specific product needs cacheing and trust me cacheing does help even on the dynamic pages, the user group is very specific and the pages to be accessed will also be specific, sites are blocked, media is blocked dynamic content like news websites might change but they do not change that often, anything I can shave off a few seconds of time is worth doing. the pages are requested before use (as in before anyone uses the connection) just to give that few minutes saved at the start even if they are recached later because of being dynamic.

I have done so much research in to the viability of this and it is 100% needed :)

Share this post


Link to post
Share on other sites

Or just update the bandwidth available ;)

Share this post


Link to post
Share on other sites

Or just update the bandwidth available ;)

ha wish that was possible ;) whats your contract rate mr budman lol?

Share this post


Link to post
Share on other sites

I can send you an invoice if you want ;) Will give you the neowin discount ;) heheeh

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!


Register a new account

Sign in

Already have an account? Sign in here.


Sign In Now
Sign in to follow this  
Followers 0

  • Recently Browsing   0 members

    No registered users viewing this page.