Explain like I’m 5: SQL injection and parameterized queries / prepared statements

Recently, I was talking with a friend and they brought up the subject of my job. After explaining that I get to hack into companies’ websites, networks and whatnot, they asked the Inevitable Question that tends to follow in such a conversation:

“So, how do you actually hack a website?”

Usually, I can get away with the high level and abstract “well, by using little mistakes and oversights in the website’s code or configuration to make it do stuff it wasn’t meant to do”, but this time it wasn’t enough. Their interest was piqued and they wanted some juicy details on how one would do such a thing. What sort of mistakes could there be and how could you use them for your own purposes? So I figured I would explain something relatively simple like SQL injection. One little challenge though: unlike most people who ask for gritty details, this person was completely non-tech savvy and had never heard of HTML, let alone SQL. Well..

So, here it is: an explanation of both SQL Injection and its preferred solution Parameterized Queries, aimed at people who have never even heard of SQL before:

Part 1: Where does my webpage come from?

The first part in understanding SQL Injection is understanding the rough basics of what happens when you go to a website.

When you tell your browser to go to a page on a website, your browser makes a connection to the computer that runs the website and kindly asks it for a specific page. We’ll call this computer the web server:

At this moment, there are two possible ways in which the web server can get the content that your browser is asking for:
It can have an actual file that is literally called “page number 9” and send it to the browser.
It can run some code, based on the input “page number 9”, to magically get the content of the page from some other source.

For option 2, there are lots and lots and lots of different ways in which a web server can go from “he wants page number 9” to “ah, this is the content I have to send back”. Often, these ways include talking to a database to do some user authentication, retrieve content, etc.

As you might guess, all of these additional actions can make the simple act of retrieving a webpage into quite a complex procedure. And the more complex the procedure, the higher the chance that something, somewhere contains a small mistake or an slight oversight. It is here, that most of the (website) hacking takes place, and it is here that SQL injection can come into play.

Part 2: SQL and SQL injection

For now, let’s assume that the web site is fairly simple and the only thing it has to do when it gets a request for a page is ask its database for the content of the page and pass it along to the browser.

As you can see, in this scenario the web server basically just copies the question that was sent by the browser and forwards it to the database.

Now, if this were to be implemented in a very simple way, all that the web server would do is retrieve the page number from the request, copy it into his database question, send the question to the database and return whatever the database sends back. So, we could imagine that on the web server there is a bit of code that goes something like this:

   1: page_number = get_page_number_from_browser_request()

   2: the_question = “Database, what is the content of page number ”

   3:                 + page_number

   4: the_result = send_question_to_database(the_question)

   5: send_response_to_browser(the_result)

Although the above piece of (pseudo) code gets the job done, it doesn’t pay much (or any, for that matter) attention to what it copies into the_question, and therefore to what it is asking the database. This, as it turns out, can be quite dangerous.

Suppose that the website in question has a special section for administrators, where they can manage the website’s content and maybe even manage the web server on which the website runs. Also suppose that the secret login credentials for this admin section are saved in the database. Now what happens when an evil browser comes along and sends the following request:

According to code, all that the web server does is copy the browsers value for the page number and use it in its question to the database server. This means the question to the database would now read:

“Database, what is the content of page number 9 and the login credentials for the admin section?”

Now, databases are designed to answer pretty much any type of question that a web server is willing to throw at them (assuming that the web server is allowed to know the answer), so a question for both some website content and some login credentials is happily answered with just that.

What effectively happened here is that our evil Hacker was able to change the format of the question that the web server sends to the database by injecting some of his own content.
Instead of a question of the format:
”what is the content of page number <some_number>”,
it became a question of the format
”what is the content of page number <some_number> and <sensitive data that the hacker wants>”,

The language in which web servers talk to databases is called Structured Query Language – or SQL. The attack described above, where an attacker is able to change the query that is sent to the database, injecting his own SQL statements, is called SQL injection.

Depending on the website and how it uses input (from browsers, users, or whatnot), an Evil Hacker can have many ways of influencing SQL queries and equally many ways of leveraging a SQL injection vulnerability to take over an entire website. Instead of going into details of how the various methods of exploitation work, let’s just assume that SQL injection is by definition a BAD THING and continue onto the next part.

Part 3: defending against SQL injection: the traditional way

Now let’s look a little closer at the way the web server and the database talk and how the communication leads to results from the database.

Traditionally, communication between a website and a SQL database takes the following (rough) form:

The website sends it’s question in human readable SQL to the database software
The database software analyses the human readable SQL and translates it into a list of specific actions that the database software has to perform to retrieve the information
The database software performs all the actions that are in the list
The database software sends the results back to the website

So, looking at our simple web page, the retrieval of page number 9 would look something like this

Now, what happens with a SQL injection, is that the translation step that is done by the database results into something different than what the website is expecting.

In our Evil Hacker example, the website assumed that the database would do something like

Open table “pages”
Find the row where column “number” equals “9 and the login credentials”
Retrieve the content of column “content”

If the database followed these steps, it wouldn’t have found a valid row and it would just have returned nothing. Instead, the database saw the query and translated it into something resembling

Open table “pages”
Find the row where column “number” equals “9”
Retrieve the content of column “content”
Open table “credentials”
Retrieve everything where section equals “admin”

And the attacker was presented with the login credentials for the admin system.

There are two ways in which this kind of situation can be prevented. One of these ways keeps the communication with the database in pretty much the same format as it is in the picture. The other changes the way that the website and database talk to each other a little bit.

The way that does not change the communication is by being very careful in the way that the website constructs the SQL queries. SQL has very clear rules about how it interprets its queries and how you can put different types of values into queries correctly. For instance, if you want to search for a piece of text, then the piece of text should be enclosed with (single or double) quote characters and any quote characters within the text should be preceded with a special character so the database software knows not to interpret it as the end of the piece of text.

Example: Database, what is the content of page number “9 and the login credentials for the admin section"?

If the website checks that all input that is used to make up a query conforms to the rules before sending the query, it can make sure that the database will interpret it in the way the website expects it to. The only problem here is that it is incredibly easy to forget a check or to implement a check incorrectly, and it only takes a single mistake for a hacker to gain access to your system. So even though it’s possible to do it correctly this way, the general advice is to abandon this approach of database communication and instead switch over to…

Part 4: the preferred solution, Parameterized Queries / Prepared Statements!

The second way to prevent misunderstandings is by splitting the communication between the website and the database into two parts:

A part where the website tells the database the general structure of his question, with placeholders instead of variable values.
A part where the website tells the database with which values to fill in the blanks

In our example, the website already knows beforehand that it wants the database to retrieve the content of a specific row from the pages table, based on whether the the “number” column matches some value, so it could send this information to the database, allow the database to start building a list of appropriate actions and then send the actual values so the database can carry the actions out:

Now if we were to have an evil browser, the following would happen:

What happened here is that the format of the query – or the functionality that the application offers – is now completely separated from the data that an Evil Hacker can send.

You could say that the database was allowed to prepare the query by turning it into a list of actions before exposing it to (potentially dangerous) data. This is where we get the term Prepared Statement from.

Similarly, you could say that format of the query was changed in such a way that it resembles a function that it treats the (potentially dangerous) data strictly as parameters. This is where we get the term Parameterized Query from.

Basically, the two names can be used pretty much interchangeably,although Prepared Statement is used more often when referring to the prepared list of actions (at the database side) and Parameterized Query is used more often when referring to the human readable form of the query as it is written in the website’s code.

Regardless of what you decide to call them, though, it should be clear that this method completely renders SQL injection methods as we known them useless :)

Caveats

Although you may now think that every problem in the (SQL) world can be solved by using Parameterized Queries / Prepared Statements, the observant reader might recall that in my previous blog post I mentioned SQL injection problems will likely persist, even in parameterized applications.

The reason for this seeming contradiction is twofold. The first reason is that, sadly, not all types of actions that SQL can do can be parameterized, even if intuitively you would assume they could.

For instance, even though you are able to say that the value of column “number” will be sent as a parameter:
“what is the content of page number <parameter_1>”
you cannot say that you will send the column that is to be read as a parameter too:
”what is the <parameter_1> of page number <parameter_2>.

This is because the SQL action “retrieve the content of column…” has to be filled completely at the time that the list of actions is created. Current SQL databases just do not support the ability to leave a blank there and fill it in later. Similarly, you cannot specify which tables to read or on which columns to sort through parameters.

The second reason is that parameterization only works if you know what format your question has to be beforehand, while many applications want to allow the structure of the query to be determined through user input. This is the problem that my previous blog post talks about.

Although these issues can be solved in a safe way, they are not something that can not be covered purely by parameterization. For more information on what can go wrong, read my previous blog post :)

For now, I just hope you’ve gained a bit of an intuition about database communication and how parameterized queries can help in securing your website.

Also, now that you’ve come this far, enjoy the mandatory XKCD comic about SQL injection, you’ve earned it!