Secure Development Tip of the Week

Subscribe by Email

Your email:

Application and Cyber Security Blog:

a Security Innovation Blog covering software engineering, cybersecurity, and application risk management

Current Articles | RSS Feed RSS Feed

Input Validation using Regular Expressions

  
  
  
  

Input validation is your first line of defense when creating a secure application, but it's often done insufficiently, in a place that is easy to bypass, or simply not done at all. Since this is a common issue I see in our assessments and something that has such a great impact on security I'd like to spend a bit of time outlining input validation best practices and give you some concrete examples of how to do it well.

Input validation is the practice of limiting the data that is processed by your application to the subset that you know you can handle. This means going beyond simple data types and diving deeply into understanding the ideal data type, range, format and length for each piece of data. One example of this might be a phone number, which could be stored as a string in memory and a varchar in the database, however there is much more information about the context of that phone number that we can use to ensure we limit our attack surface by verifying the validity of that input. You know a phone number's format is numeric and the range is 10 characters, you  quickly understand abc123Fmasdf9$1< is not a valid phone number, even if it can be stored as a string or in the database.

Whitelist or Blacklist?

The first concept of good input validation is whitelisting versus blacklisting. Whitelist, or inclusive validation defines a set of valid characters while blacklist, or exclusive validation defines a set of invalid characters to try to remove.

If we attempt to perform input validation using blacklisting we will try to enumerate each character that we know is bad. Easy ones that come to mind might be <, >, ',  -, %, etc. This can be very challenging; we need to understand every context, every attack and every encoding to be successful. In addition to context we must be able to anticipate all future attacks and bad values . This technique is nearly impossible to get right.

If we whitelist a set of characters that we know we can handle the task of validation is much easier. Take the phone number example from above; I've never seen a phone number that includes any other characters than the following: 0123456789()-+,. and space. Therefore we can quickly reject the example from the second paragraph because it contains characters that are not in this list.

Enter: The Regular Expression

A great way of defining a whitelist for input validation is to leverage Regular Expressions. Regular Expressions are incredibly powerful and can be a bit daunting at first, but once you get the hang of it you'll use them nearly every day, I know I do. 
There are many great resources for learning Regular Expressions out there on the web, that I'll list at the bottom of this post, so I don't want to spend time explaining how they work or the specific ins and outs of them, rather I'd like to walk through my process of developing a restrictive whitelist regular expression for a common example then at the bottom of the post I'll give a few extras with less explanation. I recommend you not take my word for these regular expressions, but spend a bit of time understanding how they work and what they'll do.

To help you match regular expressions I've written a simple regular expression matcher written in .NET, aptly named "RegexMatcher" it is available, free and open source on github. Simply type your regular expression into the top text box and the text you wish to match in the lower text box. Your matches will show up in the box to the right.

Download Regex Matcher

Example – Usernames

We can define usernames to be as restrictive as we'd like, but let's start with something easy such as simply "The username must contain only upper and lowercase letters" 

Therefore the following list of usernames is valid:

  • Joe
  • a
  • thisisaverylongusernameindeed

These are not:

  • Mr.Smith
  • Two Words
  • S4MMIE


First Pass

Starting with a simple regular expression we might come up with something like: 

^\w+$

This will allow one or more of any "word" character that includes numbers, letters and underscores, which means S4MIE slips through. The caret(^) defines the beginning of the string and the dollar sign($) defines the end of the string, these are good to keep in otherwise our regular expression may match, but allow additional information through.  As you can see, this is too liberal for our uses.

Username, but too liberal

Get More Restrictive

We can define a specific list of inclusive characters using the square brackets and inclusive character sets. This regular expression will match one or more (via the plus sign) upper or lowercase letters (a-z or A-Z).  

^[a-zA-Z]+$

There we go, that matches only the usernames that we want. 

Username More Restrictive

New Requirements

What if later there is a business requirement to allow numbers the dash and dot characters to usernames? We can easily add those to the whitelist like so:

^[a-zA-Z0-9.-]+$

Now we can see that S4MMIE, user-name, Mr.Smith and Joe.Basirico all get through. 

New Requirements

If we continue to take this approach we can clearly see each inclusive decision and easily see which characters will make it through, and which will not.

Other Examples

Phone Numbers

Phone numbers can be difficult if you start getting into international numbers and complicated formats. I like to strip everything out, but the digits, then make a very quick check to see if there are 10 numbers. 

^\d{10}$

Restrictive Phone Number

Otherwise a slightly longer regular expression will do: 

^1?[\(\- ]*\d{3}[\)-\. ]*\d{3}[-\. ]*\d{4}$

Longer Phone Numbers

e-mail address

e-mail addresses are notoriously difficult to match if you get too caught up in the RFC. Additionally if you try to be too compliant you may open yourself up to other issues, such as command or SQL injection or Cross Site Scripting. I suggest striking a balance between readability and restriction such as: 

^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$

email addresses

This will match the majority of e-mail addresses, but will reject the museum TLD and some very fringe e-mail addresses. Consult with your business requirements to see if something more complicated is required.

More Resources

There are some really great resources out there to find examples of regular expressions and to learn how they work. I highly suggest you learn this incredibly powerful piece of computer science.

See the following articles and websites for more information on regular expressions.

Comments

nice explanation with phone number and email regex expressions.  
 
Thank you.
Posted @ Tuesday, November 19, 2013 5:37 AM by Muralidhar
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Follow Us