Validating input is the number one protection mechanism that can prevent plethora of hacker attacks. However, what is exactly input validation? How and where should it be applied? Are there any pitfalls or any input validation technique will work? These are all valid questions and directly affects the security of your application.
Although the object that is being validated is most of the time is the input, there are cases where input validation strategy comes short. Therefore, there is another validation strategy called output validation which is equally important and sometimes more important than the input validation. However, this strategy usually is also classified under the term of input validation.
Note: Most of the time (%99) the validation should take place at the server-side. This is a critical information that every developer should know. Client side validation is by-passable.
There are more than one type of main input validation techniques; normalization, whitelisting, blacklisting and encoding. Let's see each of them in some detail;
Also known as Canonicalization, normalization can simply be explained as the transformation when applied turns the input to its simplest form. It is a quite complex technique because for different mediums the simplest form mean differently.
For example let's assume the attacker sends an input such as below to our software end-point, which in turns executes a File I/O operation.
Before executing any I/O operation if we apply file system path normalization onto this input (see FileInfo.Name), we'll get;
The input and the normalized output differs and this tells us, the developer, that there's an attack is going on.
Whitelisting is accepting the good input only. What good means is the expected input. For example, if there's a textbox which should take a credit card number from the users, at the backend the value sent by the user should be checked against the Luhn algorithm . Why? Because the credit cards should comply with this checksum formula.
Whitelisting should be the number one protection mechanism that all inputs should be checked against.
Although it seems to be easy to choose and employ whitelisting all over your application consistently, this is a tedious process and needs discipline. Never underestimate human laziness and hackers' diligence. Moreover, whitelisting falls short against some situations.
Blacklisting, as opposed to whitelisting, is rejecting the known bad inputs. For example, the most overly used character by the hackers is the single quotation character ('). Why? Because of the infamous SQL Injection weakness. We, developers, knowing this when implementing a security protection mechanism will most likely reject any input containing ' character. That is rejecting a baaad input symbol.
Is this approach right? Well, not always. This attitude toward input validation may stay short for number of cases. Firstly, certain SQL Injection cases can be exploited without using single quotation symbol. Or what happens when a business owner comes along and says "One of our customers can't register to our site because of his/her surname, O'neal... What gives?" And we all know the answer...
In this case, loosening the blacklists will open a can of worms, too.
Blacklisting should be avoided at all costs. Although it seems pretty easy to install, it's really hard and most of the time insecure to maintain it.
So, our king is whitelisting, you might think... Well, not so fast. Whitelisting is not the answer for every situation. The striking example was mentioned in the O'neal part of the blacklisting. How would you solve O'neal problem when you want to avoid SQL Injection but also accept single quotation character.
That certainly can't be done by using whitelisting, since accepting ' with no extra protection mechanism will allow the hackers to abuse. For this particular example, one of the solutions is to eredicate the meta behaviour of the single quotation character. What does that mean? That means if we put another single quotation in front of the single quotation in the input, the O'Neal becomes O''Neal. And SQL query analyzer will not think the original single quotation denotes the end of the string and process it as it is, preventing hacker to manipulate the sql query.
At it's heart this is called Encoding. That is removing the meta meaning of symbols by transforming them.
Of course, this solution (doubling the single quotes) is not a panacea and it is another story. Lastly, the best protection against SQL Injection is using Prepared Statements and it's synonims.
So, here's our take for an input validation strategy flow. At the server side, the input should be tested against whitelists before it's let to internals of the application any further. If you are really keen on apply blacklisting it should come before whitelisting, because it may only be depended as a second defence mechanism.
One more thing, in order to keep blacklisting simple is very important (as such, the whitelisting). Therefore, normalization should be applied before any blacklisting, such as recursive URL and HTML decodings.
Normalization -> Blacklisting -> Whitelisting -> Encoding