Jump to content



Photo

recommend a webcrawler?


  • Please log in to reply
5 replies to this topic

#1 Original Poster

Original Poster

    Systems Developer

  • Tech Issues Solved: 1
  • Joined: 15-July 08
  • Location: my room
  • OS: windows 7/8, Kali, ubuntu, OSx 10.9
  • Phone: Android

Posted 12 August 2014 - 12:54

hello, I have had a quick search to try and find a web crawler which will fetch all the data on a page so my caching proxy can store it for users when they use the system later on.

 

I have seen a few theories and programs out there but I want to know what one will be the best? or should I just do my own java one? all I want to do is grab a urls from a provided list and for the crawler to scan through the sites and which should then be auto-cached as the server will sit inbetween... any recommendations of ones people have used before? I dont want to be testing programs for hours so thought I would get a few reviews here :p

 

currently looking at: https://code.google.com/p/crawler4j/




#2 +BudMan

BudMan

    Neowinian Senior

  • Tech Issues Solved: 106
  • Joined: 04-July 02
  • Location: Schaumburg, IL
  • OS: Win7, Vista, 2k3, 2k8, XP, Linux, FreeBSD, OSX, etc. etc.

Posted 12 August 2014 - 13:58

To be honest cache of content on proxy is pretty old school stuff, most sites are all dynamic generated stuff now.. So while you might be able to cache some images, etc. Most content is dynamic - caching it on your proxy is not going to provide too much info.

Also what you going to do cache the whole internet? Why would you cache something that nobody has requested yet for example - how do you know anyone will ever go there - ever!! So now you just used up bandwidth for nothing.

If you want to cache locally for user number 2 going to domainx.tld that is great - but I wouldn't want to grab domainx.tld until someone has actually requested to go there. Even then since domainx.tld most likely change in 10 minutes or even on the next hit because its dynamic - caching is not going to get you much.

#3 OP Original Poster

Original Poster

    Systems Developer

  • Tech Issues Solved: 1
  • Joined: 15-July 08
  • Location: my room
  • OS: windows 7/8, Kali, ubuntu, OSx 10.9
  • Phone: Android

Posted 12 August 2014 - 14:16

To be honest cache of content on proxy is pretty old school stuff, most sites are all dynamic generated stuff now.. So while you might be able to cache some images, etc. Most content is dynamic - caching it on your proxy is not going to provide too much info.

Also what you going to do cache the whole internet? Why would you cache something that nobody has requested yet for example - how do you know anyone will ever go there - ever!! So now you just used up bandwidth for nothing.

If you want to cache locally for user number 2 going to domainx.tld that is great - but I wouldn't want to grab domainx.tld until someone has actually requested to go there. Even then since domainx.tld most likely change in 10 minutes or even on the next hit because its dynamic - caching is not going to get you much.

 

this specific product needs cacheing and trust me cacheing does help even on the dynamic pages, the user group is very specific and the pages to be accessed will also be specific, sites are blocked, media is blocked dynamic content like news websites might change but they do not change that often, anything I can shave off a few seconds of time is worth doing. the pages are requested before use (as in before anyone uses the connection) just to give that few minutes saved at the start even if they are recached later because of being dynamic.

I have done so much research in to the viability of this and it is 100% needed :)



#4 +BudMan

BudMan

    Neowinian Senior

  • Tech Issues Solved: 106
  • Joined: 04-July 02
  • Location: Schaumburg, IL
  • OS: Win7, Vista, 2k3, 2k8, XP, Linux, FreeBSD, OSX, etc. etc.

Posted 12 August 2014 - 14:21

Or just update the bandwidth available ;)

#5 OP Original Poster

Original Poster

    Systems Developer

  • Tech Issues Solved: 1
  • Joined: 15-July 08
  • Location: my room
  • OS: windows 7/8, Kali, ubuntu, OSx 10.9
  • Phone: Android

Posted 12 August 2014 - 14:25

Or just update the bandwidth available ;)

ha wish that was possible ;) whats your contract rate mr budman lol?



#6 +BudMan

BudMan

    Neowinian Senior

  • Tech Issues Solved: 106
  • Joined: 04-July 02
  • Location: Schaumburg, IL
  • OS: Win7, Vista, 2k3, 2k8, XP, Linux, FreeBSD, OSX, etc. etc.

Posted 12 August 2014 - 14:28

I can send you an invoice if you want ;) Will give you the neowin discount ;) heheeh