Wednesday, December 17, 2014

Hiring, House Hunting, and Dating: Making Decisions with Optimal Stopping Theory

There are so many important searches in life: jobs, homes, girlfriends or boyfriends, which college to go to…if only there were a mathematical formula to help us make important decisions.  Oh wait, there is!  It’s called the optimal stopping theory.  While it certainly won’t help you make the best decision all of the time, it can provide guidance.

The Secretary Problem

One example of how optimal stopping theory is applied is the secretary problem.  The secretary problem is a hypothetical situation in which you are searching for the perfect job candidate to fill a secretarial position.  In this scenario, there are a few assumptions made to keep the problem simple:

  • The number of candidates, n, is known.
  • The candidates are ranked best to worst, with no ties, but the ranks are unknown.
  • Only the best candidate is chosen.  Since ranks are unknown, you must guess who the best is.
  • The candidates are interviewed in a random order, and the decision to keep or reject is made immediately after the interview, before proceeding to the next.
  • Once a candidate is rejected, they can never be recalled.
  • Once a candidate is chosen, no more are interviewed.

If you were to pick by trial and error, you might notice that you can usually get pretty close to finding the best candidate by automatically rejecting a few, then picking the next one you find that is better than all the ones you’ve seen until then.  Unless you are unlucky enough to have the best one fall into the first few that you reject, this strategy usually works.  In fact, that is the general idea behind solving the secretary problem.

If you pick the kth candidate, then the minimum number of candidates you pass over is k-1 (because if you pick the first one, you aren’t passing over any).  Your choices are k, k+1, k+2, etc., until you reach the number of candidates, n.  Your goal is to optimize the probability of choosing the best candidate, which you could write as pn(k).

As a good start to solving this problem, you could look at a situation when n = 3.  The candidate ranks are therefore 1, 2, and 3.  Writing out the permutations when n = 3, it becomes clear that the optimal strategy is to pick k = 2.

1 2 3
1 3 2
2 1 3
2 3 1
3 1 2
3 2 1

Consider each row in the table above as a different scenario.  In the first row, the best candidate (the candidate with rank 1) shows up first.  If you pick the candidate that shows up first (in other words, you pick k = 1), then your odds of picking the highest ranked candidate are 2/6.  If you pick the candidate that shows up second, or k = 2, then your odds of picking a candidate whose rank is higher than the first one are 3/6.  If you pick the third candidate, you pick when k = 3, and your odds of picking a candidate who is better than the first two are 2/6.  So as I mentioned, the best strategy is to pick when k = 2.

You could write out the 24 permutations for n = 4, the 120 for n = 5, and so on, finding the optimal kth choice for each number of candidates, but it would quickly become hard to do manually.  Let x be the integer number of the best candidate.  An equation to describe the probability of success, S subscript (n,k), (success meaning you picked the best candidate), could be:

Reading through that: the probability of success is 1/number of candidates, n, when you pick the first candidate.  The probability of success is k-1/number of candidates, n, times the sum of 1/x-1 from x = k to n, when you pick any candidate from the second to the last.  You can test this equation against the case when n=3.  You will get the same probabilities that you manually calculated before.  A proof of how it is derived can be found here.

Now the probability of success, P(Sn,k), has to be optimized by determining what happens when the number of available candidates, n, gets infinitely large.  This requires setting the derivative of the bottom part of the piecewise function above equal to zero to find critical values, and finding the critical value that is the global maximum for the function.  At first it might seem intimidating to take the derivative of that bottom chunk, but it becomes easier when you notice that it can be rewritten as a definite integral by substituting t for k-1/n and u for x-1:

Integration rules give:

t * (ln(1) - ln(t)) = -t * ln(t)

The derivative of –t * ln(t) with respect to t is:  –ln(t) - 1.  Setting this equal to zero and solving for t gives:

-ln(t)-1 = 0
-ln(t) = 1
ln(t) = -1
t = 1/e

You could plug this value into the original equation, but by simply looking at the graph, you can tell that it is the value that maximizes the function.  This means that when you have a large number of candidates, your odds of picking the best one are the highest when you let 1/e (or about 36.8%) of the candidates go by without picking any, and then pick the next candidate who is better than all of the others seen so far.  When you do this, your odds of picking the best one are 36.8%, which may seem low, but it is the highest percentage that you can hope to get.

In the real world, you probably never know the true number of candidates, n.  But you can mitigate your lack of such knowledge by setting restrictions.  For example, if you set the number of people you’re willing to date throughout your life to 200, then you can follow the 36.8% heuristic to find the best mate out of the 200.  It is interesting to think that letting 36.8% of your options go by is like establishing a baseline.  By that point, you will have a good idea of what is out there, and will then be in a position to make a fair judgment about what option is worth choosing.

Note that there are variations of the secretary problem that change the outcome.  For example, you could have group interviews, or the candidate you choose could reject your offer.  You can read more about those scenarios here.

Exploring Optimal Stopping Theory Computationally

I wrote a Python script to test the optimal stopping theory.  You can fork my GitHub repo for it.  The script works by generating a list of 1,000 random numbers between 1 and 100.  The script loops through the list to store the first 36.8% of the variables as a baseline, just like optimal stopping theory recommends, and then finds the next highest number that is higher than all the others before it.  Because the seed is set to allow reproducibility, no matter how many times you run the script, you’ll get the same set of random numbers and therefore the same result.  In my test, the best value, 100, was encountered in the first 36.8% of the observations, and no option following that one was better.  Even though there could have been other 100s after the first one, there were no numbers that could ever exceed it.  You can change the seed to any other number and try it again.  It doesn’t matter what number you pick for the seed, it’s just used for repeatability.  For a seed of 10, the random numbers are shown below:
As you can see, if the numbers are ranked in order of 1 to 1,000, you’ll run across very high numbers rather quickly (the rank order goes from the bottom to the top of the y-axis).  So the result makes sense.  After all, you would expect to find the highest number only 36.8% of the time if the optimal stopping theory holds.

In this blog, I explained the theory of optimal stopping.  I proved why you maximize your odds of picking the best option out of a set of n total options if you let 36.8% of them go by without choosing, and then pick the next option that is better than all of the others you’ve seen until then.  Finally, I simulated a version of the secretary problem with Python, and showed that most of the time you probably won’t pick the best option.  So there is a little math that you can use to go out into the world and maximize your odds of making the best choices in life.

Saturday, November 15, 2014

How to Win at Rock-Paper-Scissors

What if I said that I could teach you how to win at rock-paper-scissors every single time?  No longer will you have to sit in the back seat, eat the smaller piece of cake, or be player two in a video game.  Does that sound too good to be true?  Maybe…but what if I said I could teach you how to win more times than your opponent?  That’s all you really need, and it seems more believable, right?  Well that’s exactly what I’m going to do!

When I was a young boy, I discovered that I could usually win at rock-paper-scissors if I played the hand that my opponent had just played and lost with.  The only requirement was that I win one round, and then I could go on winning indefinitely.  For some reason, most players who lost would choose the hand that would have beaten the hand that won the last round.  If they did not, they chose the third option, but they rarely ever replayed the hand they had just lost with.  They did not even realize they were doing it.

In psychology, this is called loss aversion.  A loss hurts twice as much as a win feels good.  Players who had just lost with paper, for example, would choose anything but paper in the next round.  So grabbing a win and then going with the hand that your opponent loses with is one strategy.  But how does that play out in terms of your odds of winning?

At first, your odds of winning are 1 in 3.  If players continued playing by picking hands at random, your odds of winning would never change.  In the long run, both of you would win approximately 33% of the time.  That is the Nash equilibrium for this game.  But people don’t behave randomly.  If you assume that my previous observation is correct, and that losers will pick the hand that would have beaten the hand that won the last round, then your odds of winning go up to 1 in 2: either they pick that hand that would have won, or they pick the other option.  They’ll rarely pick the same losing hand.

If you lose, it’s better to play the part of the loser.  In other words, pick the hand that would have won the last round.  Winners typically stick with their winning choice, much like gamblers.  In terms of psychology, winning serves as positive reinforcement for choosing that hand.  So if you pick the hand that would have won, and assume that winners will not pick the hand that just lost due to loss aversion, then your odds of winning the next round increase to 1 in 2.

In summary, if your opponent is loss averse, then your best strategies are as follows:

1.    If you win, pick the hand that your opponent lost with for the next round.
2.    If you lose, pick the hand that would have beaten your opponent for the next round.
3.    If you tie, just pick randomly, because your odds of winning are still 1 in 3.

BBC featured a story in May about rock-paper-scissors, and cited a Chinese study that corroborates what I’ve just written.  The BBC story states that winners stay with their winning hand about 33% of the time, and losers shift to the next hand in the rock-paper-scissors sequence (so they would go from rock to paper to scissors).  But contrary to the BBC article, it’s not that losers are picking the next hand in the sequence, it’s that they’re picking the hand that would have won the last round.  It just so happens that if you lose with rock, then the hand that would have won is scissors, and scissors is also the next hand in the sequence.  The same is true for losing with scissors and paper.

Of course, all of the strategies I’ve listed above are speculation.  They rely on the assumption that your opponent is unaware of these same ideas, and is just playing the hand that they feel could win.  If your opponent is aware of these ideas, it could change the game.  Or worse yet, your opponent could know about game theory, and maximizes his/her odds of winning by picking hands at random, every single time.  If that is the case, then your best bet is to choose hands at random as well.  But no matter the scenario, I’ve shown how you can improve your odds of beating your opponent.  Now you can venture out into the world and win at rock-paper-scissors, and by doing so, you can also win at life.

Saturday, November 8, 2014

The Titanic Would Not Have Survived a Head on Collision with the Iceberg

One thing I’ve frequently heard about the Titanic that drives me crazy is that if it hit the iceberg head on, instead of at an angle, it would not have sank.  That’s just absurd, and here is the math to prove it…

According to Google, the Titanic weighed 52,310 tons.  Since weight is a measure of force, which is equal to mass times acceleration due to gravity, the Titanic’s mass was:

Converting US tons to kilograms:  52,310 tons * 907.18474 kilograms per ton = 47,454,833.7494 kg

The Titanic was traveling at about 22 knots when it struck the iceberg, or 11.3178 m/s.  For simplicity, assume that it came to a dead stop upon impact, and assume that it took 1 second for that to occur.  That would make the acceleration be the same as the velocity.  Since force = mass * acceleration, the force exerted by the Titanic if it rammed the iceberg would have been:

Force =  47,454,833.7494 kg  * 11.3178 m/s = 537,084,317.41 N

According to Newton’s third law, every force has an equal and opposite reaction.  So the impact would have exerted just as much force on the Titanic.  The Titanic’s hull was made of carbon steel, which has a shear strength of 74,000 psi.  To convert Newtons per square meter to pounds per square inch, I’m going to assume the Newtons were exerted uniformly across every inch of the hull.  Since there are 1,550 square inches in 1 square meter, the force exerted on the Titanic in Newtons per square meter was:

537,084,317.41 N * 1,550 square inches = 832,480,692,000 Newtons per square meter

Since 1 Newton per square meter = 0.00015 psi, the force exerted in psi would have been:

832,480,692,000 * 0.00015 = 124,872,103.8 psi

That would have been more than enough force to break the carbon steel hull of the Titanic.  In fact, it would have been enough to shred it like paper.  The impact would have caused the steel to crumple and split into pieces, sending ripples down the entire length of the ship.  Everything inside would have been violently thrown forward like in a car crash without seatbelts.  The ship would have busted apart at the seams and sunk almost immediately.  Nobody would have survived.

For anyone who still harbors any doubt, consider the following…

Would the Titanic have broken the iceberg in half?

If the iceberg was anywhere from 200-400 feet long, as some sources suggest, it could have been at least as thick.  Trying to cut it in half with a ship would be like trying to cut a brick of ice apart with a butter knife.  It’s just not going to happen.  Furthermore, only a fraction of icebergs are above the surface of the ocean.  If the iceberg that the Titanic hit were 100 feet high as sources suggest, it could have extended many hundreds of feet below the surface.  Again, there is no way a ship so much smaller than the size of the berg would have stood a chance at splitting it apart.

Would the Titanic have pushed the iceberg out of the way?  

Momentum must be conserved in a collision.  Since there would not have been energy conversion when the Titanic struck the iceberg, it would have been an elastic collision.  Assuming that the ship came to a dead stop, transferring all of its momentum to the iceberg, the new velocity of the iceberg would have been:

mt1vt1 + mi1vi1 = mt2vt2 + mi2vi2

( 47,454,833.7494 kg * 11.3178 m/s ) + ( 150,000,000 kg * 0 m/s ) = ( 47,454,833.7494 kg * 0 m/s ) + ( 150,000,000 kg * x m/s )

537084317.4 kg * m/s = 150000000 kg * x m/s

x = 3.5806 m/s

So after the Titanic hit, the iceberg could have theoretically moved at 3.5806 m/s or about 8 mph, slowing quickly due to friction caused by the ocean water.  However, since the majority of an iceberg lies beneath the water, the Titanic would have struck the top part of it.  Like a bobble head doll, it more than likely would have just wobbled than moved.  For an object of that size, any movement would have been barely noticeable.  So there is no way the Titanic could have pushed it out of the way.

Saturday, September 13, 2014

A BI Analyst’s Review of OBIEE (Oracle Business Intelligence Suite)

Oracle Business Intelligence Suite Enterprise Edition (OBIEE) is Oracle’s premier business intelligence platform.  I wrote a review for it on that I decided to share in this blog.  So when you’re done reading, head on over to to read or write reviews for other BI platforms. 

A Comprehensive Business Intelligence Suite
Overall Rating:  4/5

I am a business intelligence analyst who has used Oracle Business Intelligence Enterprise Edition (OBIEE) and Oracle Enterprise Performance Managements (Hyperion EPM) for BI reporting. As a power user, I cannot speak to the cost or ease of system integration of Oracle Business Intelligence, but I can attest that its features are leaps and bounds beyond its predecessor, and it is easy to use once you learn how.

OBIEE is built upon a repository with 3 layers of metadata. The physical layer (bottom layer) holds the data, primary and foreign keys, and connections. The business model mapping layer (middle layer) is where the business logic, aggregation and structuring occurs. The presentation layer (top layer) is where data is arranged and presented to the end users. The presentation layer is also where security roles are assigned. Since most users will only see data from this presentation layer, it helps standardize organizational reporting so that bad queries do not result in different teams of people ending up with different data. 

Layering allows OBIEE to connect to multiple data sources and combine them all in the presentation layer. For example, you can build a report with connections to a relational table and an Essbase cube. You read that right: you could build a report using a seamlessly integrated combination of SQL and MDX, without knowing how to write either one!

OBIEE allows alerts and agents to be set up for scheduled processes. Agents are like advanced alerts that can trigger other processes depending on whether or not certain conditions are met. The automation that this feature allows is tremendous.

Mobile Application Developer is another product in the suite that allows users to build HTML 5 applications that can be viewed on any mobile device. The interface reminds me of Microsoft PowerPoint, and you simply drag and drop charts, tables, filters, etc. onto the screen. Once the app is built, you can scan the QR code to view it on your phone. Since the apps are built with HTML, they can also be viewed in a web browser. 

Overall, I would rate Oracle Business Intelligence as an excellent product.

Sunday, July 13, 2014

How Machine Learning is Changing Artificial Intelligence

I recently saw news headlines about the terminator reboot and Stephen Hawking warning about uncontrollable artificial intelligence, and it got me thinking: how close are we to having computers that can think for themselves?

I read an article on BBC in which the author, Tom Chatfield, argued that AI has lost its way.  He compared AI as it was thought of in 1956 to AI today: in 1956, the future of AI was thought to be in creating thinking computers, whereas today, the future of AI is in making computers analyze large datasets and answering problems based on the probability of an answer being correct.  Computers that have beat humans at games, like Watson in jeopardy and Deep Blue in chess, have done so by analyzing data, not by knowing anything about the game they are playing.  The process of analyzing data to learn an answer is called machine learning.  So is Tom correct?  Do computers that “think” using machine learning have false artificial intelligence?

Today’s computers still cannot pass the Turing Test; a test that is used to measure AI.  The test is carried out by placing a human and a computer in separate locations, and the human asks the computer questions.  The human is not told whether or not the answers to his/her questions are coming from another human or a computer.  If the human thinks that their conversational partner is a human, when it is actually a computer, then the computer will have passed the Turing Test.  If a computer can pass that test, then it will be smart enough to be considered intelligent.  Today’s computers are not yet intelligent by a Turing standard, but they are getting closer using machine learning.  So how could machine learning be the wrong direction in the field of AI?

Tom Chatfield references Douglas Hoftstadter’s comments about AI, in which Hoftstadter states that intelligence is “inappropriate for describing insights drawn by brute computing power from massive data sets”.  Hoftstadter argues that fancy programming has nothing to do with how intelligence arises in the human brain.  In my opinion, that is not accurate at all.  In machine learning, a computer draws a conclusion based on comparing new information to historical data.  How is that any different from what humans do?  When we face a new situation, we compare it to what we already know too.  The difference between computers and humans is that when there is no historical data to compare to, humans can still make a good guess, whereas computers cannot.

I think the reason for that difference is that humans can easily break down new information into parts, whereas computers struggle to.  For example, if you had no knowledge of soccer, but you knew everything about every other sport, and you were put into a room with nothing but a soccer ball and a net, how would you figure out what to do?  You could start by comparing the ball to other balls in sports.  You have nothing to strike the ball with, so you probably handle it manually.  It’s about the size of a basketball, so it could be tossed.  Then you look at the net and notice it is much larger than a basketball net.  Since basketball nets are small, they require precision.  It would be too easy to toss a ball into the huge net in front of you.  Furthermore, the net touches the floor.  Maybe the larger net in front of you requires less precision.  Maybe the fact that it touches the floor means the ball can be rolled into it.  Maybe you could kick the ball into the net…  Do you see how that though process works?  You break down the scenario into pieces and compare the pieces to your existing knowledge, rather than trying to match the scenario as a whole to an existing one.  To do that programmatically on a computer is difficult, but it’s almost effortless for human brains.

Imagine a toddler learning to walk, talk, and do whatever.  They have no historical data or experience to compare new information to, so they learn everything by trial and error, and by observing other people.  Anything they learn in the process is stored in their “knowledge base” until proven wrong.  Why couldn’t computers do the same thing?  Perhaps machine learning is not the wrong approach to AI, but an incomplete or evolving one.  To my knowledge, there are no machine learning techniques that entail starting with no historical data and learning everything from scratch.  What would a machine learning technique like that look like?  Consider the following thought experiment…

Suppose you want to create a computer that is capable of learning English, so that it could read a book and know whether or not the book was written in English.  How would you do it?

You could start by defining the English alphabet.  The computer could then loop through each letter of every word in the book, and for each letter, loop through the known alphabet.  If all the letters in that book are contained within the alphabet, then the book is written in English.

This program would work fine if all the letters of the English alphabet were already known.  But suppose you only knew of letters A-W, and had no idea that letters X, Y, and Z existed.  Your program could be modified so that if it found letters that weren’t in the existing alphabet, it would add them to the alphabet, and assume the new alphabet is correct.

Now suppose that the Spanish word niña is in the book.  If your program added ñ to the English alphabet, that would be incorrect.  You could modify your program to evaluate the alphabet at the end of the book.  If any letters are used fewer than x number of times, they would be thrown out of the English alphabet.  You could improve performance further by reading multiple books.  With each new book, the likelihood of your alphabet being the correct English alphabet would increase.  Of course, you would have to move the evaluation part of your program to the very end, otherwise improvements made by reading each new book would be lost.  This is the same way humans learn – assume new information is correct until proven wrong.  To be fair, you could do the reverse – assume new information is wrong until proven correct.  If you did it that way, you would not add letters to the alphabet until you saw them x number of times.

Computers that could learn without having historical data to compare to, or computers that could easily break new information into pieces that could be compared to historical data would be incredibly powerful.  In some cases, they could have perfect data.  For example, a computer making stock trades could have the entire history of the stock market to compare to; something no human brain could do.  If it did have that history, and if it could accurately assimilate new events, then could you say that the computer would be the most experienced trader on Wall Street?  I guess we’ll have to see what the future brings.

Saturday, June 21, 2014

Learn Basic SQL Part 2 – Querying Databases

In part 1 I talked about creating databases and tables with SQLite, but I didn’t explain how to add data to tables.  In this part, I will do that and go over basic querying.  Queries are the reason SQL exists.  Most jobs that require SQL skills will have you querying data, not building databases or tables.  Here’s a brief recap of part 1:

From the command line, open the SQLite folder, start SQLite3, create a database called test.db, create a table called employees with a few fields, and make sure one field is the primary key.

C:\>cd sqlite

sqlite> .open test.db
sqlite> create table employees
   ...> (
   ...> emplid integer primary key,
   ...> name varchar(20),
   ...> age integer,
   ...> sex varchar(1),
   ...> jobcode integer
   ...> );

Make sure you add the job code field to the table.  It will be used later on.

Add Data to a Table

To add data, enter the following:

sqlite> insert into employees
   ...> (emplid, name, age, sex, jobcode)
   ...> values
   ...> (1, "John Smith", 30, "M", 1001),
   ...> (2, "Alice Wonderland", 20, "F", 1002),
   ...> (3, "Billy Bob", 40, "M", 1002),
   ...> (4, "Jane Doe", 30, "F", 1003)
   ...> ;

From that you can gather that the command to enter data into tables is “insert into [table name]” followed by the list of fields that you are about to enter, then the values for those fields.  Separate each new entry with a comma.  In database lingo, new entries of data are called rows or records.

You can check to see that your data has been entered into the employees table by typing:

sqlite> select * from employees;

Notice your results show rows of data for the values you just inserted into the table.  Also notice that what you just typed is called a query.  That’s right; you just wrote your first query.  How does it feel? 
You can clean up the results by turning column headers on…

sqlite> .headers on

…and by setting the mode to column: sqlite> .mode column

Now if you run the same query, your results will look like a standard table.


Look back at that query:  sqlite> select * from employees;

The asterisk is the symbol for “everything”.  So when you select * you are selecting everything in the table.  The syntax for all SQL queries is “select [field names] from [table name]”.  Now try selecting only the employee ID’s and names:

sqlite> select emplid, name from employees;

Do you see how easy it is to query specific fields?  Now you can get fancier by adding a “where” clause.  The “where” is like a filter for your query.  It limits the results to what you specify.  For example, to query all data from the employees table for females only, type:

sqlite> select * from employees where sex = ”F”;

You can combine multiple filters with the AND/OR keywords.  For example, to query all data from the employees table for females older than 25, type:

sqlite> select * from employees where sex = “F” and age > 25;

By using parentheses, you can craft even fancier queries.  For example, to query all data from the employees table for females older than 25 or males whose name begins with a “J”, type:

sqlite> select * from employees
   ...> where
   ...> (sex = "F" and age > 25)
   ...> or
   ...> (sex = "M" and name like "J%");

Notice the introduction of the wildcard value “%” in the last query.  A wildcard is like a variable in math; it can stand for anything.  So when you want names that begin with a “J”, you can say that the name is “like” a J followed by a variable.  You could put the % in front of the J to search for names that end with a J, or on both sides of the J to search for names that contain the letter J somewhere.

The last thing I will explain about basic queries is how to sort your results.  To sort your results from the previous query by name in ascending/alphabetical order, type:

sqlite> select * from employees
   ...> where
   ...> (sex = "F" and age > 25)
   ...> or
   ...> (sex = "M" and name like "J%")
   …> order by name asc;

You could sort in reverse/descending order by changing the “asc” to “desc”.  Notice that the “order by” command should be put at the very end of the query.

Updating Data

Suppose you wanted to update or change a particular field for a particular employee.  To do this you would type “update [table name]” and then set the field that you want to change to the value that you want.  As an example, this is how you would change John Smith’s age to 35:

sqlite> update employees
   ...> set age = 35
   ...> where name = "John Smith";

Typing the “where” part of the command is crucial.  If you did not specify where to set age equal to 35, every single row in the employees table would change to age = 35.  Remember from part 1 of these SQL tutorial blogs where I explained how to add columns to a table using the alter table command?  Now you can add a column and update the records for the new column.


In part 1 of these SQL tutorial blogs I mentioned how primary and foreign keys are useful for joining tables together.  If you have an employees table, for example, you might find it easier to organize your database by keeping job specific data in a separate table, and joining the job data table to the employees table by job code.  Here is a reminder of how that join could work:

There are different types of joins based on what data you want to pull.  
  • If you want all rows from both tables where the job codes are equal, that’s called an inner join, or equal join.   
  • If you want all the rows from the employees table, plus any rows from the job table where the job code matches, that’s a left outer join.  When you do a left join, some rows in the table on the right could be left out of your results.  
  • The opposite of a left join is a right join, which will pull everything from the job table, plus any rows from the employees table where the job code matches.  
  • Finally, there is a full outer join, which pulls everything from both tables, regardless of whether the job codes match or not.  When there are records from either table that have no matching counterpart in the other table, the values for those columns will be null.  Unfortunately, SQLite does not support full outer joins at this time.  
There are other types of joins, but these are the 4 most common.  Although I included a foreign key in the diagram above, we don’t have one in the employees table right now.  That’s ok though, because SQLite allows you to join tables without foreign keys.  But before we can join anything, we need a job table:

sqlite> create table job
   ...> (
   ...> jobcode integer primary key,
   ...> jobtitle varchar(20),
   ...> category varchar(20)
   ...> );

Now fill the table with data:

sqlite> insert into job
   ...> (jobcode, jobtitle, category)
   ...> values
   ...> (1001, "CEO", "Management"),
   ...> (1002, "Accountant", "Finance"),
   ...> (1003, "Recruiter", "Human Resources");

Now we’ll create a query that will pull everything from both tables where the job codes match.  The syntax for a query with an inner join is: “SELECT [table name 1].[field name], [table name 2].[field name] FROM [table name 1] INNER JOIN [table name 2] ON [table name 1].[foreign key] = [table name 2].[primary key];”

sqlite> select employees.*, job.jobtitle, job.cateogry from employees
   ...> inner join job
   ...> on employees.jobcode = job.jobcode;

Notice that you can have the universal selector, *, included in a join statement too.  Also notice that you don’t have to specify foreign keys, but by typing the statement as it is above, it is implied that job code is a foreign key in the employees table.

One final note is that if you want to filter the results in a joined query using a WHERE clause, it must go at the end of the query, after the “on tablename.field = tablename2.field” part of the code.

I hope these tutorial blogs have helped give you an understanding of the basics of SQL.  If you want more practice you can try the SQL tutorials on w3schools, or you could try moving on to MySQL.

Tuesday, May 20, 2014

Learn Basic SQL Part 1 – Creating and Manipulating Tables

Over the last few months I’ve been teaching myself structured query language (SQL).  SQL is used to retrieve data from a database.  So in order to use it, you typically need a database and some kind of database management software that connects to the database, and can interpret SQL.  Fortunately there is a program that allows you to create a database on your computer as a .db file, and it queries the file:  SQLite.  SQLite is an excellent free tool to help you learn SQL, and I’ve found that the syntax is nearly the same as MySQL (there are different types of SQL, but they’re similar enough so that knowing one can help you figure out the others).  There are significant limitations to using SQLite, but it’s a good tool to use to become familiar with the language.  Getting started with it was easy enough, but I had to look for help from disparate sources.  So I’m writing this blog because it’s what I would have liked to have had when I first started: everything you need to know to start learning SQL from scratch, all in one place.

Download and Install SQLite

Go here to download SQLite:  If you use Windows, like me, find where the page says “Precompiled Binaries for Windows” and download the first option:

After it downloads, create a new folder in your root C: directory called SQLite, or whatever.  Unzip the downloaded SQLite folder somewhere, and copy the sqlite3.exe from the unzipped folder into the folder you just created in the C directory.  If you get stuck, here is an excellent YouTube video showing the whole process.

Now you can run SQLite by opening the .exe file, or you can do it from the command prompt.  I prefer to use the command prompt, so the instructions from here down refer to using SQLite through that.

Create a Database

Open the command prompt by typing cmd in the search bar in the Windows start menu.  When it opens it will default to your user folder.  You’ll see something like this:


You want to get back to your root C: directory, so type “cd\” without the quotes.  Now that you’re in the C directory, open the SQLite folder:

C:\>cd sqlite\

Your .exe file is inside the SQLite folder, so start it up:


Everything you type from now on is executed within the SQLite3 program.  As you can see in the command prompt, all SQL statements must end with a “;” (except for dot commands, like the next line you will type).  Now create a database called “test.db”:

sqlite> .open test.db

Notice that the same command is used to open an existing database and create a new one.  If you try to open a database that doesn’t exist, one is created with that name.  Since there wasn’t yet a “test.db” file, your computer automatically created one and put it in the same folder that SQLite3.exe is in.
That’s it!  Now you have a database called test.db.

Create a Table

Enter this code:

sqlite> create table employees
   ...> (
   ...> id integer,
   ...> name varchar(20),
   ...> age integer,
   ...> sex varchar(1)
   ...> );

Now you have a table called employees!  It has nothing in it though.  What you just did was create a table with field headers: id, name, age, and sex.  When you create fields in SQL, the data type (integer, varchar, etc) must be specified.  You can find a list of data types here.  Notice the syntax after naming the table is ( ) with the field names inside, each separated by a comma.  Also notice that your SQL statement could be typed all on 1 line or separated like I have it.  It doesn’t matter.  The statement doesn’t end until the semicolon is typed.

You can check that your database has a table in it by calling the schema function:

sqlite> .schema

This will show all of the tables that exist in your database, and the SQL you wrote to create them.  Try creating a new table and verifying it exists by calling the schema function.

Add or Delete Columns

Note: Fields/Columns are used interchangeably.

To add a salary column to your employees table, type:

sqlite> alter table employees
   …> add column
   …> salary integer;

Now you have a column added to the end of your employees table called salary, and it is an integer.  Test to see that it exists by dumping the table.  The dump function will show all the code you have written for that table.  You should see your salary column at the end:

sqlite> .dump employees

Note:  You cannot add columns in between other columns in SQLite.  To do that you would have to create a new table, copy everything to the new table in the order you want it, and then rename that new table.  Click here to see how to do that, or follow the example below.

Suppose you no longer want the salary column included in your employees table.  This is where SQLite is limited.  You cannot delete columns like you can with MySQL.  Instead you have to create a new table, copy everything to it, delete the old table and rename the new one:

sqlite> create table employees2
   ...> ( id integer, name varchar(20), age integer, sex varchar(1) );
sqlite> insert into employees2
   ...> select id, name, age, sex from employees;
sqlite> drop table employees;
sqlite> create table employees
   ...> ( id integer, name varchar(20), age integer, sex varchar(1) );
sqlite> insert into employees
   ...> select id, name, age, sex from employees2;
sqite> drop table employees2;

From that code you can see that “insert into…” and “select… from…” are the commands needed to copy data, and “drop” is the command to delete a table.  Notice that you could have skipped the insert into part of the previous code, because your current table has nothing in it.  I left that line in there so you will know how to remove columns from tables that do have data in them.  Dump your table to be sure you created it correctly.

Renaming Tables

Rename tables using the alter table command:

sqlite> alter table employees
   …> rename to employees2;

If you call the schema function, you will now only see employees2.  Try renaming it back to employees.

Creating Primary and Foreign Keys

In relational databases, primary keys are the things that identify each row of data.  That means it has to be unique to each row.  A good primary key would be an employee ID number, because no 2 people can have the same number, and an employee’s ID number can never change.  Primary keys are needed in order to join tables together, and there can only be 1 per table.  Since we did not create one in the employees table, it could never be joined to a table that lists employees’ phone number, for example (well, it could technically, but it is best practice to use keys).  While you could just add a column to your employees table for phone number, sometimes it is beneficial to store the data in another table.

Foreign keys are the things that allow tables to be joined.  A foreign key in one table matches the primary key in another table.  Suppose you had a job code number in your employees table that identifies an employee’s job.  You could store job specific data in a separate table, such as skill requirements, salary ranges, etc.  You could then join the employees table to the job table by specifying job code number as a foreign key in the employees table, and as the primary key in the job table.  Here is how that would look:

When creating tables in SQLite, it is easiest to specify the primary key at the beginning.  Otherwise you have to do the whole create new, copy data, drop old, rename new table procedure.  So go ahead and drop the employees table that exists now and start over.  Specify employee ID as the primary key and job code as a foreign key:

sqlite> create table employees (
   ...> employee_id, integer primary key,
   ...> name varchar(20),
   ...> age integer,
   ...> sex varchar(1),
   ...> job_code integer,
   ...> foreign key (job_code) references job(job_code)
   ...> );

Notice that the primary key is simply appended to the field after the data type.  You could also specify it at the end, like the foreign key is.  The foreign key is given a reference to the job_code field in the job table (which hasn’t been created yet).  Read the SQLite documentation about foreign keys to learn more.

So there you go.  Now you know the basics behind creating tables with SQL.  Stay tuned for part 2, which will cover adding data to tables and running basic queries.