• 0

Regex c# remove white space but not inside <pre> and </pre>


Question

Hi all

 

I am trying to write some regex to remove all white space from html.

 

but the regex i am currently using doesn't factor in that in the pre tags there maybe opening "<" and closing ">" tags

 

This matches everything inside the pre tags

 

(<)\s*?(pre\b[^>]*?)(>)([\s\S]*?)(<)\s*(/\s*?pre\s*?)(>)

 

e.g.

(009)156 (010) (0)<pre> <test> edehofo<w<dieoj >  ></pre>     yuui u    ji 

will match 

<pre> <test> edehofo<w<dieoj >  ></pre>

 

and 

(?<=\s)\s+(?![^<>]*</pre>)

 

eg will almost work but does not work if there is an "<" or ">" in the mark up.

 

space[     ]spaces <pre>[          ]spaces</pre>space[      ]spaces 

 

will result in 

space[ ]spaces <pre>[ ]spaces</pre>space[ ]spaces 

 

but if there is a "<" or ">" in the pre tags then it will not work.

 

Could anyone help me

 

 

 

 

 

 

 

 

 

 

10 answers to this question

Recommended Posts

  • 0

A warning: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

 

A solution: http://stackoverflow.com/questions/8762993/remove-white-space-from-entire-html-but-inside-pre-with-regular-expressions (same as your regex)

 

Just be careful with regex and HTML, parsing it to a C# object and then outputting it without whitespace is probably a better way to do this (likely to be slower though).

 

You shouldn't include < or > in <pre>, they should be escaped to &lt; and &gt;

  • 0

Thanks, but I've already been to those sites and a lot of others trying to find an solution.

 

From that its highlighted that I maybe able to do it in to step then...

 

1. encode all html with in pre tags

2. then use (?<=\s)\s+(?![^<>]*</pre>) to remove white space.

 

I am finding it hard to now find regex which will encode html with in a pre tag.

I love cats and dogs but dogs > than cats

<spaces>       <spaces>
<pre>
    1 > 0  = ?

    <code>       dedw</code>
</pre>

should be 

I love cats and dogs but > than cats <spaces> <spaces>
<pre>
    1  &gt; 0 = ?
    &glt;code&gt;       dedw &gt;/code &gt;
</pre>

It would be easier if i had some regex which basically ignored everything between a given tag.

 

I am doing my own research and trying to find the answer... but thought i would clarify what I am trying to do.

 

Thanks if you can help me find a solution

  • 0

Hi Eric

 

I'm trying to do it for c#

 

Basically I can get what i want if I can strip out white-space but not between two sets of chars

 

e.g.  <pre> anything </pre> or <code> anything </code>

 

seems more challenging that i thought it would be.

 

 (?<=\s)\s+(?![^<>]*</pre>)  is awesome but if i could get it to basically ignore everything inside of a given tag

 

My understanding of how to get the right syntax for regex is limited.

 

Match everything but not between tag(pre or code) and remove white space.

 

any help with getting this to work would be so awesome.

  • 0

How about the following: http://snipd.net/parsing-xhtml-into-a-dom-tree-in-c

You avoid using regex as you simply parse then print the HTML instead.

 

 

You might also want to re-read my first link, which points out that you cannot parse arbitrary (X)HTML using a regex. It is not possible as regex is not powerful enough to parse (X)HTML. (Parsing a limited subset is possible)

  • 0

to anyone who is good with regex please help

 

(?<=\s)\s+(?![^[<p]+?[e>])

 

cant not get this to work but must be close?

 

to: Lant

 

Sorry but i don't need to parse the HTML... i am not trying to validate it or read it really... 

 

I am wanting to remove white space but only when outside of a given tag like say <pre> or <code>

 

It has been working fine with (?<=\s)\s+(?![^<>]*</pre>)  until i hit a new requirement which.... is that the <pre> tag can contain < or >

 

which this regex does not support. One suggestion was to replace < and > with their HTML versions &lt; and &gt;

 

but again i would need to have something which only applied this on html with in tag the <pre> tag and I would need to do it in c#

 

seems easier to just get the regex to ignore things between tags some how...

 

Thanks

  • 0

If you don't mind doing the replace in C#, you can use capture groups with this pattern.

(?:\<pre\>)(.*)(?:\<\/pre\>)

It should capture everything between the pre tags and the value of the capture can then be string.Replaced on. I am working on a Regex replace version so you can use RegexOptions.Compiled. This regex was tested with CaseInsensitive and SingleLine (despite it being multi-line, using that to adjust how it handles .*)

 

1cKDh7e.png

 

 

so for example:

 

private static Regex preBlockMatch = new Regex(@"(?:\<pre\>)(.*)(?:\<\/pre\>)", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.IgnoreCase);

 

(code block)

string newValue = preBlockMatch.Captures[0].Value.Replace("<", "&lt;");

(code block)

 

obviously room for some enhancement but it gets you started

  • 0

Thought I would share what I got working......

 

Also thanks!!! to Squirrelington for the regex for the between stuff

 public class MinifiedStream : MemoryStream
        {
            private readonly Stream _output;
            public MinifiedStream(Stream stream)
            {
                _output = stream;
            }

            private static readonly Regex Whitespace = new Regex(@"(?<=\s)\s+(?![^<pre>]*</pre>)",RegexOptions.Compiled);

            private static readonly Regex PreBlockMatch = new Regex(@"(?:\<pre\>)(.*)(?:\<\/pre\>)", RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.IgnoreCase);

            Dictionary<string, string> result = new Dictionary<string, string>();

            public override void Write(byte[] buffer, int offset, int count)
            {
                var html = Encoding.UTF8.GetString(buffer);

                //------------------------------------
                var matches = PreBlockMatch.Matches(html);
                int loopcount = 1;
                foreach (Match match in matches)
                {
                    var token = Guid.NewGuid().ToString() + "_"+ loopcount;
                    html = html.Replace(match.Value, token);
                    result.Add(token, match.Value);
                    loopcount ++;
                }
                //------------------------------------



                html = Whitespace.Replace(html, string.Empty);
                html = html.Trim();

                //-----------------------------------
                foreach (var match in result)
                {
                    html = html.Replace(match.Key, match.Value);
                }

                //-----------------------------------

                _output.Write(Encoding.UTF8.GetBytes(html), offset, Encoding.UTF8.GetByteCount(html));
            }
        }

But still feels as if you should be able to create a straight regex version.

 

if anyone can help me improve that would be awesome

  • 0

I agree that there should be a way to get it exclusively in regex but I am only really good at matching things, I don't have a lot of experience in replacement syntax. I know you can do $1 for the first group, $2 for 2nd or ${name} for named groups but how to modify the contents of those groups in the replacement is beyond me atm. Maybe someone with more experience will be able to chime in. :) Glad it is working thus far though. \o/

 

btw the regex I gave you, I probably went a little too crazy with the non-capture groups. I think at the time I was trying some funky magic to try and capture parts of it for replacement purposes but ended up with that.

<pre>(.*)</pre>

would probably be fine as well. lol

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • I was already thinking about trying Fairphone with Murena /e/OS exactly because I don't want any AI on my communication device. For now, I've turned off all AI possible and turned on "Process data only on device" on my Galaxy S24.
    • They should lower the price but make clothing in the shops, like they've had in previous games, actually cost money. (Unless it's required for a mission)
    • FotoSketcher Studio 4.30 by Razvan Serea FotoSketcher Studio is a free creative tool that instantly turns your photos into artwork. With just a few clicks, you can apply styles like pencil sketches, watercolor, oil painting, ink drawings, cartoons, and abstract effects. Powerful yet simple sliders let you adjust every detail to match your vision, from subtle enhancements to bold artistic transformations. You can also improve your images with tools for contrast, sharpness, color, and brightness, plus add frames and text. It supports JPEG, PNG (including transparency), BMP, and WebP formats for both opening and exporting. Available for Windows (64-bit) and macOS (Apple Silicon and Intel), FotoSketcher also offers batch processing, manual retouching, and custom effects for advanced users. Completely free, including commercial use, with no ads or spyware. FotoSketcher Studio 4.30 changelog: New LUTs effect In the Photo Lab category you will find a new LUTs (Lookup Tables) effect for custom colour grading. It comes with 18 presets built right into the application (Cinematic, Kodachrome, Golden hour, Portrait and more), each with a live preview so you can see the look before applying it. Right-click the preview to compare with your original image. And if you have your own .cube files, you can load those from disk too. New "Stylize" category Added a brand-new Stylize category with four new effects. Three of them (Mosaic, Circles and Triangles) are built on the same space-filling idea, turning your photo into tiled mosaics, clusters of circles, or a low-poly triangle look. The fourth, Cartoon, is a bit different: it gives a clean graphic style with defined lines and solid colours. A little tip - set the colour intensity to 0 and Cartoon draws a pure outline drawing with no fill. Combined with the "Do not erase background" option, you can run it on top of any other effect to add crisp contours in two quick steps. Auto-Enhance and Full Reset Two small but handy additions. Auto-Enhance (just press E) tries to optimise your source photo - exposure, white balance, local contrast - for a 'dehazed', natural-looking starting point. And Full Reset clears everything back to a clean slate in one step (working image, history, LUT, script, brushes and sliders) without having to close and reopen the program. Press Ctrl+Shift+N, or right-click the exit button. Scripts The script engine has been updated to support all the new effects, with a few new commands and improved help. If you would like some inspiration, the recent scripting tutorial on this blog walks through several ready-made example scripts. The usual improvements The interface now uses new colour accents for better readability, the history gallery has had a few tweaks, and you can now open .webp images directly. On top of that, this release includes some performance improvements and a good number of bug fixes. Download: FotoSketcher Studio 4.30 | 55.9 MB (Freeware) Download: FotoSketcher Studio for macOS | 119.0 MB View: FotoSketcher Website | Screenshot Get alerted to all of our Software updates on Twitter at @NeowinSoftware
    • I have a Motorola, one of the lower end ones, it works fine. It is possible to get rid of the Gemini app and also to disable googles assistant , but A.i is still apps. I try to avoid all LLM A.I, is i can, I use no Ai duck duck go.
  • Recent Achievements

    • Conversation Starter
      sumytbe earned a badge
      Conversation Starter
    • One Year In
      B4dM1k3 earned a badge
      One Year In
    • One Year In
      DarkWun earned a badge
      One Year In
    • Dedicated
      Almohandis earned a badge
      Dedicated
    • Dedicated
      JuvenileDelinquent earned a badge
      Dedicated
  • Popular Contributors

    1. 1
      +primortal
      519
    2. 2
      +Edouard
      189
    3. 3
      PsYcHoKiLLa
      87
    4. 4
      Michael Scrip
      81
    5. 5
      Steven P.
      72
  • Tell a friend

    Love Neowin? Tell a friend!