Microsoft caught using Google search results...


Recommended Posts

from: WebProNews

I was questioned today by a developer who was watching a particular IP address scan his site. The IP was 65.54.188.86 and is registered to Microsoft Corp. located at One Microsoft Way, Redmond, Washington 98052. This visitor was not sending the normal header information associated with a crawler to the web server such as an http robot name or identifying info or even a browser name.

The behavior it demonstrated made it look like a crawler, especially since it was spidering urls that were no longer in existence (search engine spiders crawl site segments at regular intervals and often come back when an initial crawl left urls uncrawled) and doing so at the rate of 1 page every 3 - 5 seconds. The visitor started their visit at 7:37 am and was still on the site at 12:00 pm.

Correction, the data was there after all, here's the crawler info... msnbot/0.3 (+http://search.msn.com/msnbot.htm)

Here's the kicker

So now you're saying, so what, big deal. But this really is a big deal. It's a big deal not only because the urls this visitor was making requests to don't exist any longer but because the only place these urls can be found is in Google's search results using site:www.sitename.com. A similar query on MSN Search doesn't show the urls at all, even on the beta version of their new Microsoft search engine. But then within just hours of the visitors exit from the site the new same search at Microsoft's new search engine shows all of the urls in question being fully indexed within its results.

My Theory On This Mysterious Microsoft Crawler

The old msn required a fee to be crawled by its spider. But a few months back MSN dropped the fee and said they were going to begin crawling the entire web and doing it without charge. However, that's no easy task. So I believe MSN is using the results from Google and possibly even Yahoo to get all of the pages they've indexed on sites that have a relatively low page count in the current msn search engine.

First off, that's the fastest way to get the relevant pages from a web site. Sure they could just go to the site directly and start crawling but in doing so they're going to get tons of duplicate urls and urls that seem different but point to the same content. Crawling Google's results will eliminate the bandwidth to some extent but will not completely take care of the duplicate content issue their spider will encounter.

Secondly, crawling Google's results can act as a qualitative measure for their new search engine. By creating a baseline number of pages per site when the new Microsoft Search is launched and running a comparison on a regular interval for the next 6 months, they'll be able to determine internally if their engine is finding and indexing the same links and as many links as Google. Call it competitive analysis or whatever you want.

So Microsoft's Screen Scraping?

Obviously my conclusion should be taken as a grain of salt but it's a definite possibility. Microsoft very well could be screen scraping Google (or maybe even using their API, LOL) and crawling the urls it finds. It makes sense from a business case but I wonder if there are any legal issues there. I doubt it. It's like putting garbage out to the curb. Once it's out there it's fair game but I bet Google's lawyers would have more to say than that on the case.

http://www.webpronews.com/insiderreports/s...archEngine.html

this doesn't mean crap? for all we know it was a research project doing something else from microsoft research. just because a IP is owned by MS doesn't mean its being used by the company MS, it could be an at home employee or any number of other things... including VPN users that have access to MS's ip range

i guess microsoft uses the few text filters (e.g. -microsoft, and the search wont bring up anything with microsoft in it...) so i guess they only put a "microsoft.com: keywords" so you can only browse threw there site... cheap cheap microsoft... richest person in the world cant afford to get developers to create a search engine?

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • Darwin Mach Kernel/ BSD. Which is both open source.
    • You mean BSD. The UNIX kernel is not open source.
    • I have been on Linux Mint full-time since Jan 2019. while there is a learning curve, it's worth it if you can make the switch, which I can. I mainly browse web, play some games (generally Lutris(Windows games), or through emulators (MAME/Mesen/Flycast etc) etc), and use a small amount of Windows programs (Foobar2000/7-zip/WinRAR/ImgBurn etc) etc. it's nice to dodge the bloat/BS of Windows as you can really feel Linux is all around faster, especially certain things. p.s. recently I enabled NTSync for games since Linux Mint recently offered the 6.14 kernel, which has NTSync support (it's not enabled by default though but you can enable it temporarily for your current boot of the OS through 'sudo modprobe ntsync')
    • A rotten article full of handwaving of anti-consumer practices. Aside from the fact that no, not even close to every piece of major software collects information about its users, ethical developers make such telemetry opt-in and allow it to be completely disabled. To use KDE Plasma as an example, you're shown a greeter upon first boot that gives users the option to send the developers telemetry, with the default being off and 'off' actually meaning off. Windows 10 has never offered that capability - only a promise that Microsoft will slurp up less of your data if you spend time tweaking 50 different privacy settings. There is still absolutely no way to completely opt out of sending Microsoft telemetry in any version of Windows 10 (or 11, naturally). Even using group policy in Enterprise editions only allows you to reduce telemetry to the bare minimum. Home-focused editions don't even get that option. Articles like this dismissing user privacy concerns as "FUD" are a part of the reason Microsoft felt confident enough to go so much further with Windows 11. I guess you get what you deserve in that respect. Personally, I finally made the move to Linux after 15 years or so of dabbling with it, but never really considering a permanent switch. Enjoy your bright, shiny Windows future. You asked for it, after all.
  • Recent Achievements

    • First Post
      smileyhead earned a badge
      First Post
    • One Month Later
      K V earned a badge
      One Month Later
    • Week One Done
      K V earned a badge
      Week One Done
    • Dedicated
      CarlosABC earned a badge
      Dedicated
    • One Month Later
      solidox earned a badge
      One Month Later
  • Popular Contributors

    1. 1
      +primortal
      638
    2. 2
      ATLien_0
      240
    3. 3
      Xenon
      172
    4. 4
      neufuse
      155
    5. 5
      +FloatingFatMan
      122
  • Tell a friend

    Love Neowin? Tell a friend!