• 0

CNN blocking RSS access for lots of sites


Question

Examples:

http://www.redrivernet.com/links/

http://www.wabi.tv/content/4007/National_News/

http://www.freewebportal.net/

You can find more examples by searching for a couple of the headlines in one search.

Notice how the first article is from September 2 (over a month ago)?

When my web server (example 1) attempts to retrieve the feed from http://rss.cnn.com/rss/cnn_topstories.rss , they instead redirect to http://feedproxy.feedburner.com/rss/cnn_topstories.rss which has that older news.

They are purposely blocking me and many others. I guess they don't want people to use them as a source of news?

Does anyone know how best to work around this?

Link to comment
Share on other sites

14 answers to this question

Recommended Posts

  • 0

You are right. It does. And if you click on the second one, you can see the difference. My guess is that only certain IPs are blocked.

My computer's IP can get to the first link, my web server is always redirected to the second link.

So CNN is blocking access from those three sites and many others to their current updated RSS feed info.

Link to comment
Share on other sites

  • 0

I would guess you were hitting the feed too often and an automated system stepped in a redirected you to an older cache as a method of reducing load on the main feeds.

Link to comment
Share on other sites

  • 0

Their TTL is only 5 minutes. I doubt that was the issue, but nevertheless, we are blocked.

Also, this automated system is also blocking hundreds of other publicly accessible sites.

More examples:

http://www.losangelesdailynews.net/

http://newtownhighschool.org/index.php

http://www.scary-software.com/RSS/

http://www.4seasonswireless.com/news.php

News is over a month old?!

http://www.google.com/search?q=%22Bush+to+...nvestigators%22

Link to comment
Share on other sites

  • 0

Could it be that they're blocking specific clients? ie, browser based readers are ok, and identified as real people consuming the news, rather than a script scraping the news to be displayed on another site.

If you can, might be worth editing whatever script you use on your site so it identifies itself as a browser based reader, see if that works.

Link to comment
Share on other sites

  • 0
I would guess you were hitting the feed too often and an automated system stepped in a redirected you to an older cache as a method of reducing load on the main feeds.

Bingo. You might want to cache the stories for at least an hour before hitting their server again. Lots of places limit how often an IP address can hit their feeds. In all honesty, it's a good practice.

Link to comment
Share on other sites

  • 0

That's all well and good, but I set up caching a few weeks ago now, and they're still blocking me.

And the blockage is based on my server's IP address. When I try to go there in lynx on that server, it also gets the old data.

Link to comment
Share on other sites

  • 0

That is the true. Feedproxy is out of date.

And my server is getting redirected to the out-of-date feedproxy data.

How do I make them either get updated feedproxy or get around their redirection?

Link to comment
Share on other sites

  • 0

What IP / ISP are you coming from - I have gone straight to http://rss.cnn.com/rss/cnn_topstories.rss and get todays stories no problem.

How are you fetching the feed - via browser or code?

If its code, try looking at PHP and CURL. Using that, you can imitate a true browser visit so hopefully the feed will give you the proper version.

If you need a hand with the curl stuff shout!

Link to comment
Share on other sites

  • 0
That's all well and good, but I set up caching a few weeks ago now, and they're still blocking me.

And the blockage is based on my server's IP address. When I try to go there in lynx on that server, it also gets the old data.

It could still be client-based redirection.

I'd add a user-agent header to the request to mimic a real browser and give it a try.

Link to comment
Share on other sites

  • 0

I just did a little test. My first request without a browser user agent string was redirected to http://feedproxy.feedburner.com/rss/cnn_topstories.rss. My second request with the UA of "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.30 Safari/525.13" was not redirected and retrieved http://rss.cnn.com/rss/cnn_topstories.rss.

It's client-based redirection. Add a user agent string to the request headers and you'll be fine. Also don't test things in Lynx and expect accurate results ;)

Link to comment
Share on other sites

  • 0

I'm using the NewsParserX snippet with modxcms.

I'm not familiar enough to know how to change the headers that the script uses when it pulls the data...

Although I do see that they updated their feedproxy data to have yesterday's "news", so that's some progress.

Link to comment
Share on other sites

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.