Safely Letting Specific HTML Tags Through Sanitization in PHP

How To | April 20th, 2010

Sometimes you want to let your users express themselves and style their input—whether it be comments, stories, or whatever else—with a few HTML tags. The trick is doing this without letting through all sorts of bad mojo. Now there are many ways to do this, some more complicated than others. I’ve devised a fool-proof accomplish this. While this can work with any tag, (i.e. making [b] into <b>), in this example I’ll be selectively letting through actual HTML tags, rather than aliases. I like to think that by letting users use real HTML tags I might one day help a computer semi-literate learn the fundamentals of HTML. Who knows?

As with most mini-tutorials, I’ll start with what doesn’t work.

What doesn’t work

1. strip_tags

Strip tags is bad for a few reasons. First, it ends up removing content, and sometimes more than you would think. That is not the goal of [textfield] sanitation. Sanitation is meant to prevent mischief while preserving the integrity of the user’s message. Secondly, and more importantly:

This function does not modify any attributes on the tags that you allow using allowable_tags, including the style and onmouseover attributes that a mischievous user may abuse when posting text that will be shown to other users.

That is bad for you, bad for me.

2. preg_replace (or any other regular expression function)

Regular expression is great for some things, but trying to create regex algorithms that will accurately do what you want here is not likely, and even if it does, it will be overly-complicated, time consuming, and not quickly modifiable  if you want to allow more tags. Basically, it’s overkill.

So what is a good way to let certain HTML tags through?

The Right Way

The right way is simple and maintains the integrity of the original input by first making a clean sweep, and then going back and choosing what to let through. Here is the code.

function sanitize_out($input){
  htmlentities($input);
  $c_p_open=0;
  $c_p_close=0;
  $c_b_open=0;
  $c_b_close=0;
  $input = str_replace('&lt;br&gt;','<br />',$input);
  $input = str_replace('&lt;hr&gt;','<hr />',$input);
  $input = str_replace('&lt;p&gt;','<p>',$input,&$c_p_open);
  $input = str_replace('&lt;/p&gt;','</p>',$input,&$c_p_close);
  $input = str_replace('&lt;b&gt;','<b>',$input,&$c_b_open);
  $input = str_replace('&lt;/b&gt;','</b>',$input,&$c_b_close);
  while($c_p_open > $c_p_close){
    $input .= '</p>';
    ++$c_p_close;
  }
  while($c_b_open > $c_b_close){
    $input .= '</b>';
    ++$c_b_close;
  }
  return $input;
}

Now the code here could be cleaner, but it gets the job done.

What is it doing?

1. Convert all brackets and other XML-important characters to their corresponding entities with htmlentities().
2. Back-convert specific entities into their HTML counterparts. In this case; <p> </p> <b> </b> <br> and <hr>.
3. Closes all hanging tags, so that there will be a </b> for every <b> and a </p> for every <p>.

Next time you want to do this type of input formatting, do it this way, rather than using overly-complicated eregs or some other expression.

  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Live
  • Ping.fm
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • Twitter
  • Yahoo! Bookmarks
  • Tumblr
  • RSS
  • Yahoo! Buzz
  • Design Float
  • DZone

2 Responses to “Safely Letting Specific HTML Tags Through Sanitization in PHP”

  1. SeanJA says:

    You might be better off using regex for the br and hr replace, a lot of people will close their tags properly and that will mess things up for you.

  2. admin says:

    That’s possible, Sean. Though you could also argue if you are using this for a forum or something similar, you would have “key” with the specific valid codes and buttons to insert those codes into the textarea. I will show how to do this in a follow-up post.

    Also, out of curiosity I did a speed test for variations of preg_replace and str_replace…

    100000 iterations, using 5 paragraphs input from lorem ipsum with 5 ‘<br>’s randomly about:
    .306 seconds
    $t=str_replace('&lt;br>',''&lt;br /&gt;',$input);

    .793 seconds
    $t=preg_replace('/&lt;br&gt;/','&lt;br /&gt;',$input);

    .594 seconds
    $t=str_replace('&lt;br&gt;','&lt;br /&gt;',$input);
    $t=str_replace('&lt;br /&gt;','&lt;br /&gt;',$t);


    .852 seconds
    $t=str_replace('&lt;br&gt;','&lt;br /&gt;',$input);
    $t=str_replace('&lt;br /&gt;','&lt;br /&gt;',$t);
    $t=str_replace('&lt;br/&gt;','&lt;br /&gt;',$t);


    .880 seconds
    $t=preg_replace('/&lt;br(\s){0,1}(\/){0,1}&gt;/','&lt;br /&gt;',$input);

    The time is really microscopic, but making regex’s aren’t as easy as str_replaces.

    I will say preg_replace WORKS, but I like to avoid regex when something else simple works just as well.

Leave a Reply