• 0

c# clean HTML trash from feed


Question

Hello gang,

 

I am making a Windows Phone app that reads the NYC transit info.    The mta has a data feed: http://mta.info/status/serviceStatus.txt  but the <text? tag is filled with various html tags and symbols.  Sadly enough wrapping this up in <html></html> tags does not allow it to render so I was looking to simply clear it out with string replace (ugly, I know) so I thought I'd look for other ideas here.

 

James

Link to comment
Share on other sites

5 answers to this question

Recommended Posts

  • 0
XmlDocument doc = new XmlDocument();
 
doc.Load("http://mta.info/status/serviceStatus.txt");
 
var texts = doc.GetElementsByTagName("text");
 
string innerText = texts[0].InnerText;
 
var t = Regex.Replace(innerText, @"<(.|\n)*?>", "");

It does render fine in a WebBrowser control if you use control.NavigateToString(node.InnerText).

It's an evil, filthy hack but it works as long as none of the data has "<>" around it.

Link to comment
Share on other sites

  • 0

Really?!   Very interesting.  Thanks.  I will check it out.

The next question is how did you figure this out?  Meaning, how did you know that it would work without some of these characters?

Link to comment
Share on other sites

  • 0

HTML is a subset of SGML like XML. The <text> blocks have to be escaped to be valid XML, but the XmlDocument class reads the & meta tags as if they were what they represent. (e.g.: "<" instead of "<".

Unless they toss in some text inside of <> brackets the text should parse through Regex just fine. You'd just be missing whatever is in the brackets.

 

As for the WebBrowser control, I already knew the NavigateToString() method didn't require a full document. :)

 

You could probably parse the tags themselves and convert the HTML to RTF and shove it in a RichTextBox control instead, too.

Link to comment
Share on other sites

  • 0

Sad to admit that I did not know HTML was a subset of SGML (after ALL of these years) 

 

I can't use the XMLDocument class in a WP8 app, but I can work that out, likely with the XDocument class.  Again, I'm more impressed that you knew that this data would work in this fashion.

 

Have a great evening

Link to comment
Share on other sites

This topic is now closed to further replies.