Jump to content



Photo

c# clean HTML trash from feed

c# html

  • Please log in to reply
5 replies to this topic

#1 James Rose

James Rose

    Software Developer

  • Tech Issues Solved: 1
  • Joined: 20-January 04
  • Location: New York City

Posted 24 August 2013 - 17:53

Hello gang,

 

I am making a Windows Phone app that reads the NYC transit info.    The mta has a data feed: http://mta.info/stat...rviceStatus.txt  but the <text? tag is filled with various html tags and symbols.  Sadly enough wrapping this up in <html></html> tags does not allow it to render so I was looking to simply clear it out with string replace (ugly, I know) so I thought I'd look for other ideas here.

 

James




#2 Eric

Eric

    Neowinian Senior

  • Tech Issues Solved: 11
  • Joined: 02-August 06
  • Location: Greenville, SC

Posted 24 August 2013 - 22:28

XmlDocument doc = new XmlDocument();
 
doc.Load("http://mta.info/status/serviceStatus.txt");
 
var texts = doc.GetElementsByTagName("text");
 
string innerText = texts[0].InnerText;
 
var t = Regex.Replace(innerText, @"<(.|\n)*?>", "");

It does render fine in a WebBrowser control if you use control.NavigateToString(node.InnerText).

It's an evil, filthy hack but it works as long as none of the data has "<>" around it.



#3 OP James Rose

James Rose

    Software Developer

  • Tech Issues Solved: 1
  • Joined: 20-January 04
  • Location: New York City

Posted 24 August 2013 - 22:40

Really?!   Very interesting.  Thanks.  I will check it out.

The next question is how did you figure this out?  Meaning, how did you know that it would work without some of these characters?



#4 Eric

Eric

    Neowinian Senior

  • Tech Issues Solved: 11
  • Joined: 02-August 06
  • Location: Greenville, SC

Posted 24 August 2013 - 22:56

HTML is a subset of SGML like XML. The <text> blocks have to be escaped to be valid XML, but the XmlDocument class reads the & meta tags as if they were what they represent. (e.g.: "<" instead of "&lt;".

Unless they toss in some text inside of <> brackets the text should parse through Regex just fine. You'd just be missing whatever is in the brackets.

 

As for the WebBrowser control, I already knew the NavigateToString() method didn't require a full document. :)

 

You could probably parse the tags themselves and convert the HTML to RTF and shove it in a RichTextBox control instead, too.



#5 OP James Rose

James Rose

    Software Developer

  • Tech Issues Solved: 1
  • Joined: 20-January 04
  • Location: New York City

Posted 24 August 2013 - 23:02

Sad to admit that I did not know HTML was a subset of SGML (after ALL of these years) 

 

I can't use the XMLDocument class in a WP8 app, but I can work that out, likely with the XDocument class.  Again, I'm more impressed that you knew that this data would work in this fashion.

 

Have a great evening



#6 Eric

Eric

    Neowinian Senior

  • Tech Issues Solved: 11
  • Joined: 02-August 06
  • Location: Greenville, SC

Posted 24 August 2013 - 23:23

XDocument should work just fine. I was going to suggest it but I wasn't sure if it was in Silverlight.





Click here to login or here to register to remove this ad, it's free!