• 0

Regex c# remove white space but not inside <pre> and </pre>


Question

Hi all

 

I am trying to write some regex to remove all white space from html.

 

but the regex i am currently using doesn't factor in that in the pre tags there maybe opening "<" and closing ">" tags

 

This matches everything inside the pre tags

 

(<)\s*?(pre\b[^>]*?)(>)([\s\S]*?)(<)\s*(/\s*?pre\s*?)(>)

 

e.g.

(009)156 (010) (0)<pre> <test> edehofo<w<dieoj >  ></pre>     yuui u    ji 

will match 

<pre> <test> edehofo<w<dieoj >  ></pre>

 

and 

(?<=\s)\s+(?![^<>]*</pre>)

 

eg will almost work but does not work if there is an "<" or ">" in the mark up.

 

space[     ]spaces <pre>[          ]spaces</pre>space[      ]spaces 

 

will result in 

space[ ]spaces <pre>[ ]spaces</pre>space[ ]spaces 

 

but if there is a "<" or ">" in the pre tags then it will not work.

 

Could anyone help me

 

 

 

 

 

 

 

 

 

 

10 answers to this question

Recommended Posts

  • 0

A warning: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

 

A solution: http://stackoverflow.com/questions/8762993/remove-white-space-from-entire-html-but-inside-pre-with-regular-expressions (same as your regex)

 

Just be careful with regex and HTML, parsing it to a C# object and then outputting it without whitespace is probably a better way to do this (likely to be slower though).

 

You shouldn't include < or > in <pre>, they should be escaped to &lt; and &gt;

  • 0

Thanks, but I've already been to those sites and a lot of others trying to find an solution.

 

From that its highlighted that I maybe able to do it in to step then...

 

1. encode all html with in pre tags

2. then use (?<=\s)\s+(?![^<>]*</pre>) to remove white space.

 

I am finding it hard to now find regex which will encode html with in a pre tag.

I love cats and dogs but dogs > than cats

<spaces>       <spaces>
<pre>
    1 > 0  = ?

    <code>       dedw</code>
</pre>

should be 

I love cats and dogs but > than cats <spaces> <spaces>
<pre>
    1  &gt; 0 = ?
    &glt;code&gt;       dedw &gt;/code &gt;
</pre>

It would be easier if i had some regex which basically ignored everything between a given tag.

 

I am doing my own research and trying to find the answer... but thought i would clarify what I am trying to do.

 

Thanks if you can help me find a solution

  • 0

Hi Eric

 

I'm trying to do it for c#

 

Basically I can get what i want if I can strip out white-space but not between two sets of chars

 

e.g.  <pre> anything </pre> or <code> anything </code>

 

seems more challenging that i thought it would be.

 

 (?<=\s)\s+(?![^<>]*</pre>)  is awesome but if i could get it to basically ignore everything inside of a given tag

 

My understanding of how to get the right syntax for regex is limited.

 

Match everything but not between tag(pre or code) and remove white space.

 

any help with getting this to work would be so awesome.

  • 0

How about the following: http://snipd.net/parsing-xhtml-into-a-dom-tree-in-c

You avoid using regex as you simply parse then print the HTML instead.

 

 

You might also want to re-read my first link, which points out that you cannot parse arbitrary (X)HTML using a regex. It is not possible as regex is not powerful enough to parse (X)HTML. (Parsing a limited subset is possible)

  • 0

to anyone who is good with regex please help

 

(?<=\s)\s+(?![^[<p]+?[e>])

 

cant not get this to work but must be close?

 

to: Lant

 

Sorry but i don't need to parse the HTML... i am not trying to validate it or read it really... 

 

I am wanting to remove white space but only when outside of a given tag like say <pre> or <code>

 

It has been working fine with (?<=\s)\s+(?![^<>]*</pre>)  until i hit a new requirement which.... is that the <pre> tag can contain < or >

 

which this regex does not support. One suggestion was to replace < and > with their HTML versions &lt; and &gt;

 

but again i would need to have something which only applied this on html with in tag the <pre> tag and I would need to do it in c#

 

seems easier to just get the regex to ignore things between tags some how...

 

Thanks

  • 0

If you don't mind doing the replace in C#, you can use capture groups with this pattern.

(?:\<pre\>)(.*)(?:\<\/pre\>)

It should capture everything between the pre tags and the value of the capture can then be string.Replaced on. I am working on a Regex replace version so you can use RegexOptions.Compiled. This regex was tested with CaseInsensitive and SingleLine (despite it being multi-line, using that to adjust how it handles .*)

 

1cKDh7e.png

 

 

so for example:

 

private static Regex preBlockMatch = new Regex(@"(?:\<pre\>)(.*)(?:\<\/pre\>)", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.IgnoreCase);

 

(code block)

string newValue = preBlockMatch.Captures[0].Value.Replace("<", "&lt;");

(code block)

 

obviously room for some enhancement but it gets you started

  • 0

Thought I would share what I got working......

 

Also thanks!!! to Squirrelington for the regex for the between stuff

 public class MinifiedStream : MemoryStream
        {
            private readonly Stream _output;
            public MinifiedStream(Stream stream)
            {
                _output = stream;
            }

            private static readonly Regex Whitespace = new Regex(@"(?<=\s)\s+(?![^<pre>]*</pre>)",RegexOptions.Compiled);

            private static readonly Regex PreBlockMatch = new Regex(@"(?:\<pre\>)(.*)(?:\<\/pre\>)", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.IgnoreCase);

            Dictionary<string, string> result = new Dictionary<string, string>();

            public override void Write(byte[] buffer, int offset, int count)
            {
                var html = Encoding.UTF8.GetString(buffer);

                //------------------------------------
                var matches = PreBlockMatch.Matches(html);
                int loopcount = 1;
                foreach (Match match in matches)
                {
                    var token = Guid.NewGuid().ToString() + "_"+ loopcount;
                    html = html.Replace(match.Value, token);
                    result.Add(token, match.Value);
                    loopcount ++;
                }
                //------------------------------------



                html = Whitespace.Replace(html, string.Empty);
                html = html.Trim();

                //-----------------------------------
                foreach (var match in result)
                {
                    html = html.Replace(match.Key, match.Value);
                }

                //-----------------------------------

                _output.Write(Encoding.UTF8.GetBytes(html), offset, Encoding.UTF8.GetByteCount(html));
            }
        }

But still feels as if you should be able to create a straight regex version.

 

if anyone can help me improve that would be awesome

  • 0

I agree that there should be a way to get it exclusively in regex but I am only really good at matching things, I don't have a lot of experience in replacement syntax. I know you can do $1 for the first group, $2 for 2nd or ${name} for named groups but how to modify the contents of those groups in the replacement is beyond me atm. Maybe someone with more experience will be able to chime in. :) Glad it is working thus far though. \o/

 

btw the regex I gave you, I probably went a little too crazy with the non-capture groups. I think at the time I was trying some funky magic to try and capture parts of it for replacement purposes but ended up with that.

<pre>(.*)</pre>

would probably be fine as well. lol

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • Get this powerful mini PC with Core Ultra 9, 32GB RAM, and 1TB SSD for just $799 by Taras Buria The ASUS NUC 14 Pro+ is a powerful mini PC with capable hardware, and right now, you can get it on Amazon with a big discount. At just $799, this computer offers a Core Ultra 9 processor, 32GB of memory, and a 1TB SSD. The NUC 14 Pro+ features a low-profile aluminum chassis, which can be opened without removing rubber feet or undoing any screws. Its toolless design lets you access the storage without a screwdriver. The computer also has a rich set of ports. On the front side, you will find two USB 3.2 Gen 2 Type-C, one USB 3.2 Gen 2x2 Type-C, and a power button. Unlike the Mac mini, which has a frustrating power button placement, the power button in the NUC 14 Pro+ is located where it should be. The back of the NUC 14 Pro+ has a DC-in port, two Thunderbolt 4 ports, one 2.5G Ethernet port, one USB 3.2 Gen2 Type-A, one USB 2.0 Type-A, two HDMI 2.1, and a Kensington lock. Finally, there is a VESA mount, which lets you place the device on the back of your monitor for a cleaner desk. The computer is powered by Intel's 14th-gen Core Ultra 9 185H processor, 32GB of DDR5 memory, and a 1TB PCIe Gen4 NVMe SSD. Windows 11 Home is preinstalled, so you do not need to bring your own drive, memory, or Windows 11 license. ASUS NUC 14 Pro+ Core Ultra 9 185H, 32GB RAM, 1TB SSD - $799.99 | 27% off on Amazon US This Amazon deal is US-specific and not available in other regions unless specified. If you don't like it or want to look at more options, check out the Amazon US deals page here. Get Prime (SNAP), Prime Video, Audible Plus or Kindle / Music Unlimited. Free for 30 days. As an Amazon Associate, we earn from qualifying purchases.
    • This guy is just salty that Waymo is about to get buried by a company with cars that cost significantly less, charge significantly lower fares, and will soon dramatically outnumber their fleet. Waymo made the mistake of not reducing their vehicle cost quick enough and not overcoming their route limitations. Unless they start allowing their cars to use the freeways and have significantly wider geofencing, they're going to soon join the list of discontinued Google products. If Tesla wasn't the one to make them irrelevant, somebody else soon was. There's a long list of companies designing robotaxis right now.
    • LOL. Hard to believe people still fall for this. If you are having some sort of issue, I would work on fixing that instead turning off these settings.
    • That is a great option for compatibility, but in my opinion, that isn't the future. Xorg/Xserver is outdated with massive security holes and limitations built into the core design, which cannot be easily fixed. The reason Wayland exists is because it was apparent that no one had the resources/will to revamp Xorg, so it was basically put into a support only mode until it was eventually abandoned. Yes, X11Libre has taken up the mantal, but I don't expect to see anything from them other than basic support.
    • I agree with your frustrations, but after nearly a decade of Wayland ideologs debating how software they don't write should work...its time to rip the band aid of X11 off and let Wayland sink or swim on its own. Its not like Linux can just fail at this point, so devs will flock together to find solutions. It is my opinion that a lot of these silly debates about things like window decorations take place because they can. People feel like they have time to have these academic conversations to "get it right." However, the conversation will change very quickly when the issue is "###### don't work." People will quickly find fixes once we are forced into that mode. I draw a parallel to the infancy of the internet going public in the late 1980s. It became quickly apparent that IPv4 really wasn't up to the task. The ivory tower response to the issues was basically "your doing it wrong, you shouldn't want that" while debating long-off solutions like IPv6. Then some rando cames along and invited NAT, the standards people saw it as an abomination and absolutely refused to include it. He didn't care, sold the product anyway under the name PIX, which he later sold to Cisco. It was not only a massive success, but it changed the entire concept of the internet, basically inventing the idea of public and privet addresses, which totally reformed the way the internet works. The standards guys were forced to adopt it once they realized it was impossible to put the cat back in the bag.
  • Recent Achievements

    • Week One Done
      fredss earned a badge
      Week One Done
    • Dedicated
      fabioc earned a badge
      Dedicated
    • One Month Later
      GoForma earned a badge
      One Month Later
    • Week One Done
      GoForma earned a badge
      Week One Done
    • Week One Done
      ravenmanNE earned a badge
      Week One Done
  • Popular Contributors

    1. 1
      +primortal
      651
    2. 2
      Michael Scrip
      226
    3. 3
      ATLien_0
      219
    4. 4
      +FloatingFatMan
      146
    5. 5
      Xenon
      137
  • Tell a friend

    Love Neowin? Tell a friend!