Monday, January 19, 2015

Learn Basic SQL Part 4 – Case Statements and Unions

The first 3 parts of these tutorials have covered just about all of the basics of SQL.  There are a large variety of functions out there, but the ones I have shown are arguably the most common and will get you through most situations.  In part 4, I will explain case statements and unions, which are two final topics that I think would be helpful for SQL beginners.  At the end, I’ll list resources for further exploration. 

Case Statements

Case statements test an input for criteria, and when a condition is met, something else is returned.  For example, you could use a case statement to return “Rich” when a person’s salary is above a given threshold, or “Poor” if it is not.  Using the same database as part 3, that would be:

SELECT, b.salary,
    WHEN b.salary > 100000 THEN “Rich”
    ELSE “Poor”
FROM employees a
JOIN jobs b ON b.job_code = a.job_code;

Notice that case statements also end with END.  If there were multiple WHEN clauses, none of them would have been separated by commas.  Also note that SQLite requires double quotes instead of single. 
Case statements can also be used to get tallies of how often an event occurs.  You can do this by returning 1 when a condition is met, and 0 otherwise, summing the whole case statement, and then return desired values using a HAVING clause.  For example, here is the SQL to get a count of production engineers who report to Candide Optimist: 

    WHEN (a.manager_id =
        (SELECT id FROM employees
        WHERE name = "Candide Optimist")
    AND a.job_code =
        (SELECT job_code FROM jobs
        WHERE title LIKE "%Production Engineer%"))
    THEN 1
    ELSE 0
    END) count_of_engineers
FROM employees a
JOIN employees b ON = a.manager_id
GROUP BY HAVING count_of_engineers > 0;

I guarantee there are easier ways to get a count of production engineers working for a given manager, but I wanted to show how to do it using a summed case statement. 


Unions are used to stack records in a database.  Since they are used to combine rows of data, they can only be used when the queries being unionized have the same number of columns.  A good example of when this would be useful can be found on w3schools’ tutorial: combining a list of customer addresses with a list of supplier addresses.  Maybe you want to know all of the cities where your company has operations.  A union query would allow you to stack the datasets and report off of the results.  Since there is no need for one in the database I created for these tutorials, and the w3schools tutorial is easy enough to understand, I won’t include an example here.

Query Examples

1.  Which city has the most employees, and what is the average salary in that city?

SELECT city, Max(average_salary)
  SELECT, AVG(b.salary) AS average_salary
  FROM employees a
  JOIN jobs b ON b.job_code = a.job_code
  JOIN location c ON c.location = a.location

For this task, it is easiest to get the average salary by city first, then select the city with the highest average salary from that.  In order to join salary information from the jobs table with the city names in the location table, the employees table had to be used to join both of them.  It is acceptable to use a third table to join on, even if no fields are being pulled from that third table. 

2.  Is there a difference between the salaries of men and women who have the same job code?

SELECT a.job_code,, AVG(b.salary)
FROM employees a
JOIN jobs b ON b.job_code = a.job_code
WHERE a.job_code IN
    (SELECT job_code FROM employees
     GROUP BY job_code HAVING COUNT(DISTINCT sex) > 1)
GROUP BY a.job_code,;

For this task, you only want to see job codes that contain both men and women.  The subquery ensures that you do, by pulling back job codes where the count of distinct instances of sex is greater than 1 (because if a job code had men and women in it, then there would be 2 distinct results).  Looking through the results, you can see that in every job code where there were men and women, the salaries are the same.

3.  Which city has the highest percentage of employees who are 30 or younger? 

SELECT city, MAX(percent_under_30)
        WHEN b.age < 30
        THEN 1.0
        ELSE 0.0
    / COUNT( * 100 AS percent_under_30
FROM employees a
LEFT JOIN employees b ON =
JOIN location c ON c.location = a.location

The key here, as with most tasks that require subqueries, is to start from the inside out.  The inside of this code simply pulls the percentage of employees under 30 for each city.  I found it easiest to start simple; build a query that pulls back a headcount under 30, and a total headcount for each location.  Then figure out how to divide them.  Then put that into the subquery, and finally pull in the city name.  Note that since this query calculates a percentage as a real number, you must use decimal points in your case statement to force SQLite to calculate the answer as a real number.  Otherwise SQLite will convert the numbers to integers before dividing, and if you ran the middle part of the code, you would get zeroes for the percentages of people under 30 for 2 of the 3 cities. 

SQL Resources

SQL Fiddle is a free online IDE for writing SQL.  You build a database on the left, and write queries on the right.  You could have done all 4 of my tutorials using this instead of SQLite. 

More tutorials about SQL can be found at and  I found SQLZOO’s exercises to be much better, but they don’t provide a lot of instructions. 

Tuesday, January 6, 2015

Learn Basic SQL Part 3 – Subqueries and Aggregate Data

In part 1 of these basic SQL blogs, I showed how to create databases and tables, and how to manipulate those tables.  In part 2, I showed how to update or add data to tables, and how to query those tables.  At the end of part 2, I described how tables can be joined.  Now in part 3, I’m going to explain how and why you can build subqueries, as well as how data can be aggregated or summarized.  The first block of code below will build a database with a few tables of data.  It will serve as a review of parts 1 and 2, and add complexity to the database that will be needed for subquerying and aggregating. 

C:\>cd sqlite

In previous blogs, I have included the “sqlite>” and “…>” that start every line within the program, because I wanted to comfort beginners by showing them exactly what they would see on their screens.  But from now on I will omit it to keep the code more concise.  (Note that in order to paste the code below into your command line, you have to right click and choose paste.)

.open myCompany.db

CREATE TABLE employees
    id integer primary key,
    name varchar(25),
    age integer,
    sex varchar(1),
    job_code integer,
    location varchar(5),
    manager_id integer

    job_code integer primary key,
    title varchar(30),
    category varchar(20),
    pay_grade integer,
    salary real

    location varchar(5) primary key,
    city varchar(25),
    state varchar(25)

INSERT INTO employees
    (id, name, age, sex, job_code, location, manager_id)
    (1, "John Smith", 40, "M", 1001, "A100", 1),
    (2, "Jane Smith", 40, "F", 1002, "A100", 1),
    (3, "Alice Wonderland", 20, "F", 1005, "A100", 2),
    (4, "Nick Carroway", 35, "M", 1007, "B100", 6),
    (5, "Blanche DuBois", 30, "F", 1004, "B100", 6),
    (6, "Candide Optimist", 55, "M", 1003, "B100", 1),
    (7, "Quentin Compson", 25, "M", 1004, "C100", 6),
    (8, "April Wheeler", 25, "F", 1005, "A100", 2),
    (9, "Winston Smith", 45, "M", 1005, "A100", 2),
    (10, "Don Juan", 35, "M", 1005, "B100", 2),
    (11, "Bob Cratchit", 30, "M", 1004, "A100", 6),
    (12, "Peregrin Took", 20, "M", 1008, "B100", 4),
    (13, "Rosie Cotton", 20, "F", 1004, "A100", 6),
    (14, "Dolores Haze", 60, "F", 1006, "A100", 1),
    (15, "La Esmerelda", 30, "F", 1008, "B100", 4);

    (job_code, title, category, pay_grade, salary)
    (1001, "CEO", "Executive", 10, 200000),
    (1002, "VP of Sales", "Executive", 9, 190000),
    (1003, "VP of Production", "Executive", 9, 150000),
    (1004, "Production Engineer", "Associate", 6, 80000),
    (1005, "Sales Associate", "Associate", 6, 80000),
    (1006, "HR Manager", "Management", 7, 100000),
    (1007, "Production Specialist", "Management", 7, 110000),
    (1008, "Quality Test Engineer", "Associate", 5, 70000);

INSERT INTO location
    (location, city, state)
    ("A100", "Seattle", "Washington"),
    ("B100", "Portland", "Oregon"),
    ("C100", "San Francisco”, "California");

As you can see, there is nothing new so far, but the additional data and tables will allow you to do more with your queries. 


Suppose that you want to get a list of all the employees who report to John Smith.  If you know John Smith’s employee Id, then you could simply write:

SELECT * FROM employees WHERE manager_id = [Insert John’s employee Id here]

But in the real world, you’ll probably have no idea what John’s employee Id is.  This is where a subquery comes in handy.  A subquery is a query within a query.  If that sounded exciting, your instincts were correct; writing queries within queries is just as intense as watching Leonardo DiCaprio create dreams within dreams.  To conceptualize how this might be done, think about how you could break the problem down:
  • Problem:  You want to find all of the employees who report to John Smith.
    • Translation:  Find everyone whose manager_id = John Smith’s employee Id.  
    • This is the main query.
  • Problem:  You need to know John Smith’s employee Id in order to complete the main query.
    • Translation:  Find John Smith’s employee Id.  
    • This is the subquery.
Here is how the SQL would look:

SELECT * FROM employees
WHERE manager_id =
    (SELECT id FROM employees
    WHERE name = "John Smith");

I indented the subquery and put it in parentheses.  This subquery returns 1 value, which is used to filter manager_id in the main query.  You may notice when you run this query that John Smith is returned as his own manager.  If you want to filter him out of the results, you would specify that in the main query, not in the subquery, because you want the subquery to return his employee Id.  So it would become:

SELECT * FROM employees
WHERE manager_id =
    (SELECT id FROM employees
    WHERE name = "John Smith")
AND name != “John Smith”;

As another example, suppose you want a list of Vice Presidents (employees with “VP” in their job titles).  Here is the SQL:

SELECT * FROM employees
WHERE job_code IN
    (SELECT job_code FROM jobs
    WHERE title LIKE “%VP%”);

Notice that “IN” is used instead of an equals sign.  That is because there are 2 Vice Presidents, so the subquery returns 2 results.  If you used an equals sign, then only the VP whose job code matches the first job code returned by the subquery would show up.  The IN is used to show employees whose job codes are in the results returned from the subquery.  The % symbol is a wildcard value and is used in this case to represent anything that could come before or after VP.  If you just put a “VP%” then the query would look for anything that starts with VP.  Also, “%VP” looks for anything that ends with VP, so by boxing the VP in between the wildcard symbols, it is like saying “find VP somewhere in the job title.” 

You may have wondered why a join wasn’t used instead of a subquery.  Many times joins can be substituted for subqueries.  In fact, it is often preferable to use joins because it takes less time to process than a subquery.  Here is how the same query would look with a join instead of a subquery:

SELECT * FROM employees
JOIN jobs ON jobs.job_code = employees.job_code
WHERE jobs.title LIKE “%VP%”;

The tradeoff is that you must specify which fields you want returned, because requesting them all returns all of the fields from employees and jobs. 

Data Aggregation

There are several aggregate functions in SQL, such as COUNT, MAX, MIN, SUM, and AVG.  They are very easy to use.  You just put them in the select statement, like so:

SELECT COUNT(id) FROM employees;

That will give you a count of how many employees there are.  Here’s another example that gives the average salary for employees in the “Associate” job category:

SELECT AVG(salary) FROM jobs WHERE category = “Associate”;

What if you want to see the average salary for each job category?  You can use the GROUP BY function to do this.  For example:

SELECT category, AVG(salary) FROM jobs GROUP BY category;

You can add fields to the GROUP BY clause using commas.  For example, to show the average salary by pay grade for each category, do this:

SELECT category, pay_grade, AVG(salary) FROM jobs GROUP BY category, pay_grade;

In the real world, it is best to be careful how detailed you get when providing salary like this.  The last query pulls back salaries for pay grades in which there are only single people in those grades.  So if you thought about it, you could probably discern who those people are and what they make. 

The GROUP BY statement is often coupled with a HAVING statement.  This is useful when you want to filter aggregated data, because you cannot explicitly aggregate data after a WHERE statement.  For example, a query with “WHERE COUNT(employees) > 10” is not allowed.  But you can insert a HAVING clause to filter results like that instead.  Here is a query that shows the names of employees who have 5 or more people reporting to them:

SELECT name FROM employees
(SELECT manager_id FROM employees
GROUP BY manager_id HAVING COUNT(id) >= 5);

Aliases and Self Joins

In the last example, you found the name of the employee who has 5 or more employees reporting to them.  Knowing their name is great, but what if you also want to know how many people report to them?  That requires a self join.  Yes, tables can be joined to themselves.  When you do this, each table needs to be renamed to keep track of which one is which.  You can do this by assigning aliases.  Aliases are assigned either by specifying an “AS [alias here]” or simply by putting the alias after a space after the table name, like “[table name here] [alias here]”. 

The self join is done the same way as a regular join.  To find the count of direct reports for a manager, the employees table has to be joined to itself, but on manager_id = id.  So the manager Id from the first employees table is joined to the id of the second employees table, because you want information from the second table about the manager from the first table.  That may sound confusing, but here is what it looks like:

SELECT, COUNT( FROM employees a
JOIN employees b ON b.manager_id =
    (SELECT manager_id FROM employees
    GROUP BY manager_id HAVING COUNT(id) >= 5);

As you can see, the manager name is pulled from the employees table that is renamed as “a”.  That manager name is derived from the same subquery as before – the one that finds the Ids of managers with 5 or more employees.  So the employee Ids from table a are managers (the subquery that filters table a makes sure that only employees who are managers are returned).

I showed you a query earlier that provided the counts of direct reports for a given manager.  Before, the count of reports was derived from the count of employee Ids whose manager Id equaled a certain value.  The same concept applies here.  It cannot be pulled from the “a” table however, because that one is pulling information about each manager as an employee.  You need to get information about the manager as a manager.  To do that, a second employees table (b) is used that is joined by its manager_id to the employee Id of table a.  That way, any information that comes from table b pertains to managers (because table a contains only managers).  This allows you to do a count of the Ids where the manager_id = the employee Id in table a. 

That last query was tough, so I will stop there.  Stay tuned for part 4, which will cover case statements, unions, and a few examples that will tie everything together. 

Wednesday, December 17, 2014

Hiring, House Hunting, and Dating: Making Decisions with Optimal Stopping Theory

There are so many important searches in life: jobs, homes, girlfriends or boyfriends, which college to go to…if only there were a mathematical formula to help us make important decisions.  Oh wait, there is!  It’s called the optimal stopping theory.  While it certainly won’t help you make the best decision all of the time, it can provide guidance.

The Secretary Problem

One example of how optimal stopping theory is applied is the secretary problem.  The secretary problem is a hypothetical situation in which you are searching for the perfect job candidate to fill a secretarial position.  In this scenario, there are a few assumptions made to keep the problem simple:

  • The number of candidates, n, is known.
  • The candidates are ranked best to worst, with no ties, but the ranks are unknown.
  • Only the best candidate is chosen.  Since ranks are unknown, you must guess who the best is.
  • The candidates are interviewed in a random order, and the decision to keep or reject is made immediately after the interview, before proceeding to the next.
  • Once a candidate is rejected, they can never be recalled.
  • Once a candidate is chosen, no more are interviewed.

If you were to pick by trial and error, you might notice that you can usually get pretty close to finding the best candidate by automatically rejecting a few, then picking the next one you find that is better than all the ones you’ve seen until then.  Unless you are unlucky enough to have the best one fall into the first few that you reject, this strategy usually works.  In fact, that is the general idea behind solving the secretary problem.

If you pick the kth candidate, then the minimum number of candidates you pass over is k-1 (because if you pick the first one, you aren’t passing over any).  Your choices are k, k+1, k+2, etc., until you reach the number of candidates, n.  Your goal is to optimize the probability of choosing the best candidate, which you could write as pn(k).

As a good start to solving this problem, you could look at a situation when n = 3.  The candidate ranks are therefore 1, 2, and 3.  Writing out the permutations when n = 3, it becomes clear that the optimal strategy is to pick k = 2.

1 2 3
1 3 2
2 1 3
2 3 1
3 1 2
3 2 1

Consider each row in the table above as a different scenario.  In the first row, the best candidate (the candidate with rank 1) shows up first.  If you pick the candidate that shows up first (in other words, you pick k = 1), then your odds of picking the highest ranked candidate are 2/6.  If you pick the candidate that shows up second, or k = 2, then your odds of picking a candidate whose rank is higher than the first one are 3/6.  If you pick the third candidate, you pick when k = 3, and your odds of picking a candidate who is better than the first two are 2/6.  So as I mentioned, the best strategy is to pick when k = 2.

You could write out the 24 permutations for n = 4, the 120 for n = 5, and so on, finding the optimal kth choice for each number of candidates, but it would quickly become hard to do manually.  Let x be the integer number of the best candidate.  An equation to describe the probability of success, S subscript (n,k), (success meaning you picked the best candidate), could be:

Reading through that: the probability of success is 1/number of candidates, n, when you pick the first candidate.  The probability of success is k-1/number of candidates, n, times the sum of 1/x-1 from x = k to n, when you pick any candidate from the second to the last.  You can test this equation against the case when n=3.  You will get the same probabilities that you manually calculated before.  A proof of how it is derived can be found here.

Now the probability of success, P(Sn,k), has to be optimized by determining what happens when the number of available candidates, n, gets infinitely large.  This requires setting the derivative of the bottom part of the piecewise function above equal to zero to find critical values, and finding the critical value that is the global maximum for the function.  At first it might seem intimidating to take the derivative of that bottom chunk, but it becomes easier when you notice that it can be rewritten as a definite integral by substituting t for k-1/n and u for x-1:

Integration rules give:

t * (ln(1) - ln(t)) = -t * ln(t)

The derivative of –t * ln(t) with respect to t is:  –ln(t) - 1.  Setting this equal to zero and solving for t gives:

-ln(t)-1 = 0
-ln(t) = 1
ln(t) = -1
t = 1/e

You could plug this value into the original equation, but by simply looking at the graph, you can tell that it is the value that maximizes the function.  This means that when you have a large number of candidates, your odds of picking the best one are the highest when you let 1/e (or about 36.8%) of the candidates go by without picking any, and then pick the next candidate who is better than all of the others seen so far.  When you do this, your odds of picking the best one are 36.8%, which may seem low, but it is the highest percentage that you can hope to get.

In the real world, you probably never know the true number of candidates, n.  But you can mitigate your lack of such knowledge by setting restrictions.  For example, if you set the number of people you’re willing to date throughout your life to 200, then you can follow the 36.8% heuristic to find the best mate out of the 200.  It is interesting to think that letting 36.8% of your options go by is like establishing a baseline.  By that point, you will have a good idea of what is out there, and will then be in a position to make a fair judgment about what option is worth choosing.

Note that there are variations of the secretary problem that change the outcome.  For example, you could have group interviews, or the candidate you choose could reject your offer.  You can read more about those scenarios here.

Exploring Optimal Stopping Theory Computationally

I wrote a Python script to test the optimal stopping theory.  You can fork my GitHub repo for it.  The script works by generating a list of 1,000 random numbers between 1 and 100.  The script loops through the list to store the first 36.8% of the variables as a baseline, just like optimal stopping theory recommends, and then finds the next highest number that is higher than all the others before it.  Because the seed is set to allow reproducibility, no matter how many times you run the script, you’ll get the same set of random numbers and therefore the same result.  In my test, the best value, 100, was encountered in the first 36.8% of the observations, and no option following that one was better.  Even though there could have been other 100s after the first one, there were no numbers that could ever exceed it.  You can change the seed to any other number and try it again.  It doesn’t matter what number you pick for the seed, it’s just used for repeatability.  For a seed of 10, the random numbers are shown below:
As you can see, if the numbers are ranked in order of 1 to 1,000, you’ll run across very high numbers rather quickly (the rank order goes from the bottom to the top of the y-axis).  So the result makes sense.  After all, you would expect to find the highest number only 36.8% of the time if the optimal stopping theory holds.

In this blog, I explained the theory of optimal stopping.  I proved why you maximize your odds of picking the best option out of a set of n total options if you let 36.8% of them go by without choosing, and then pick the next option that is better than all of the others you’ve seen until then.  Finally, I simulated a version of the secretary problem with Python, and showed that most of the time you probably won’t pick the best option.  So there is a little math that you can use to go out into the world and maximize your odds of making the best choices in life.

Saturday, November 15, 2014

How to Win at Rock-Paper-Scissors

What if I said that I could teach you how to win at rock-paper-scissors every single time?  No longer will you have to sit in the back seat, eat the smaller piece of cake, or be player two in a video game.  Does that sound too good to be true?  Maybe…but what if I said I could teach you how to win more times than your opponent?  That’s all you really need, and it seems more believable, right?  Well that’s exactly what I’m going to do!

When I was a young boy, I discovered that I could usually win at rock-paper-scissors if I played the hand that my opponent had just played and lost with.  The only requirement was that I win one round, and then I could go on winning indefinitely.  For some reason, most players who lost would choose the hand that would have beaten the hand that won the last round.  If they did not, they chose the third option, but they rarely ever replayed the hand they had just lost with.  They did not even realize they were doing it.

In psychology, this is called loss aversion.  A loss hurts twice as much as a win feels good.  Players who had just lost with paper, for example, would choose anything but paper in the next round.  So grabbing a win and then going with the hand that your opponent loses with is one strategy.  But how does that play out in terms of your odds of winning?

At first, your odds of winning are 1 in 3.  If players continued playing by picking hands at random, your odds of winning would never change.  In the long run, both of you would win approximately 33% of the time.  That is the Nash equilibrium for this game.  But people don’t behave randomly.  If you assume that my previous observation is correct, and that losers will pick the hand that would have beaten the hand that won the last round, then your odds of winning go up to 1 in 2: either they pick that hand that would have won, or they pick the other option.  They’ll rarely pick the same losing hand.

If you lose, it’s better to play the part of the loser.  In other words, pick the hand that would have won the last round.  Winners typically stick with their winning choice, much like gamblers.  In terms of psychology, winning serves as positive reinforcement for choosing that hand.  So if you pick the hand that would have won, and assume that winners will not pick the hand that just lost due to loss aversion, then your odds of winning the next round increase to 1 in 2.

In summary, if your opponent is loss averse, then your best strategies are as follows:

1.    If you win, pick the hand that your opponent lost with for the next round.
2.    If you lose, pick the hand that would have beaten your opponent for the next round.
3.    If you tie, just pick randomly, because your odds of winning are still 1 in 3.

BBC featured a story in May about rock-paper-scissors, and cited a Chinese study that corroborates what I’ve just written.  The BBC story states that winners stay with their winning hand about 33% of the time, and losers shift to the next hand in the rock-paper-scissors sequence (so they would go from rock to paper to scissors).  But contrary to the BBC article, it’s not that losers are picking the next hand in the sequence, it’s that they’re picking the hand that would have won the last round.  It just so happens that if you lose with rock, then the hand that would have won is scissors, and scissors is also the next hand in the sequence.  The same is true for losing with scissors and paper.

Of course, all of the strategies I’ve listed above are speculation.  They rely on the assumption that your opponent is unaware of these same ideas, and is just playing the hand that they feel could win.  If your opponent is aware of these ideas, it could change the game.  Or worse yet, your opponent could know about game theory, and maximizes his/her odds of winning by picking hands at random, every single time.  If that is the case, then your best bet is to choose hands at random as well.  But no matter the scenario, I’ve shown how you can improve your odds of beating your opponent.  Now you can venture out into the world and win at rock-paper-scissors, and by doing so, you can also win at life.

Saturday, November 8, 2014

The Titanic Would Not Have Survived a Head on Collision with the Iceberg

One thing I’ve frequently heard about the Titanic that drives me crazy is that if it hit the iceberg head on, instead of at an angle, it would not have sank.  That’s just absurd, and here is the math to prove it…

According to Google, the Titanic weighed 52,310 tons.  Since weight is a measure of force, which is equal to mass times acceleration due to gravity, the Titanic’s mass was:

Converting US tons to kilograms:  52,310 tons * 907.18474 kilograms per ton = 47,454,833.7494 kg

The Titanic was traveling at about 22 knots when it struck the iceberg, or 11.3178 m/s.  For simplicity, assume that it came to a dead stop upon impact, and assume that it took 1 second for that to occur.  That would make the acceleration be the same as the velocity.  Since force = mass * acceleration, the force exerted by the Titanic if it rammed the iceberg would have been:

Force =  47,454,833.7494 kg  * 11.3178 m/s = 537,084,317.41 N

According to Newton’s third law, every force has an equal and opposite reaction.  So the impact would have exerted just as much force on the Titanic.  The Titanic’s hull was made of carbon steel, which has a shear strength of 74,000 psi.  To convert Newtons per square meter to pounds per square inch, I’m going to assume the Newtons were exerted uniformly across every inch of the hull.  Since there are 1,550 square inches in 1 square meter, the force exerted on the Titanic in Newtons per square meter was:

537,084,317.41 N * 1,550 square inches = 832,480,692,000 Newtons per square meter

Since 1 Newton per square meter = 0.00015 psi, the force exerted in psi would have been:

832,480,692,000 * 0.00015 = 124,872,103.8 psi

That would have been more than enough force to break the carbon steel hull of the Titanic.  In fact, it would have been enough to shred it like paper.  The impact would have caused the steel to crumple and split into pieces, sending ripples down the entire length of the ship.  Everything inside would have been violently thrown forward like in a car crash without seatbelts.  The ship would have busted apart at the seams and sunk almost immediately.  Nobody would have survived.

For anyone who still harbors any doubt, consider the following…

Would the Titanic have broken the iceberg in half?

If the iceberg was anywhere from 200-400 feet long, as some sources suggest, it could have been at least as thick.  Trying to cut it in half with a ship would be like trying to cut a brick of ice apart with a butter knife.  It’s just not going to happen.  Furthermore, only a fraction of icebergs are above the surface of the ocean.  If the iceberg that the Titanic hit were 100 feet high as sources suggest, it could have extended many hundreds of feet below the surface.  Again, there is no way a ship so much smaller than the size of the berg would have stood a chance at splitting it apart.

Would the Titanic have pushed the iceberg out of the way?  

Momentum must be conserved in a collision.  Since there would not have been energy conversion when the Titanic struck the iceberg, it would have been an elastic collision.  Assuming that the ship came to a dead stop, transferring all of its momentum to the iceberg, the new velocity of the iceberg would have been:

mt1vt1 + mi1vi1 = mt2vt2 + mi2vi2

( 47,454,833.7494 kg * 11.3178 m/s ) + ( 150,000,000 kg * 0 m/s ) = ( 47,454,833.7494 kg * 0 m/s ) + ( 150,000,000 kg * x m/s )

537084317.4 kg * m/s = 150000000 kg * x m/s

x = 3.5806 m/s

So after the Titanic hit, the iceberg could have theoretically moved at 3.5806 m/s or about 8 mph, slowing quickly due to friction caused by the ocean water.  However, since the majority of an iceberg lies beneath the water, the Titanic would have struck the top part of it.  Like a bobble head doll, it more than likely would have just wobbled than moved.  For an object of that size, any movement would have been barely noticeable.  So there is no way the Titanic could have pushed it out of the way.

Saturday, September 13, 2014

A BI Analyst’s Review of OBIEE (Oracle Business Intelligence Suite)

Oracle Business Intelligence Suite Enterprise Edition (OBIEE) is Oracle’s premier business intelligence platform.  I wrote a review for it on that I decided to share in this blog.  So when you’re done reading, head on over to to read or write reviews for other BI platforms. 

A Comprehensive Business Intelligence Suite
Overall Rating:  4/5

I am a business intelligence analyst who has used Oracle Business Intelligence Enterprise Edition (OBIEE) and Oracle Enterprise Performance Managements (Hyperion EPM) for BI reporting. As a power user, I cannot speak to the cost or ease of system integration of Oracle Business Intelligence, but I can attest that its features are leaps and bounds beyond its predecessor, and it is easy to use once you learn how.

OBIEE is built upon a repository with 3 layers of metadata. The physical layer (bottom layer) holds the data, primary and foreign keys, and connections. The business model mapping layer (middle layer) is where the business logic, aggregation and structuring occurs. The presentation layer (top layer) is where data is arranged and presented to the end users. The presentation layer is also where security roles are assigned. Since most users will only see data from this presentation layer, it helps standardize organizational reporting so that bad queries do not result in different teams of people ending up with different data. 

Layering allows OBIEE to connect to multiple data sources and combine them all in the presentation layer. For example, you can build a report with connections to a relational table and an Essbase cube. You read that right: you could build a report using a seamlessly integrated combination of SQL and MDX, without knowing how to write either one!

OBIEE allows alerts and agents to be set up for scheduled processes. Agents are like advanced alerts that can trigger other processes depending on whether or not certain conditions are met. The automation that this feature allows is tremendous.

Mobile Application Developer is another product in the suite that allows users to build HTML 5 applications that can be viewed on any mobile device. The interface reminds me of Microsoft PowerPoint, and you simply drag and drop charts, tables, filters, etc. onto the screen. Once the app is built, you can scan the QR code to view it on your phone. Since the apps are built with HTML, they can also be viewed in a web browser. 

Overall, I would rate Oracle Business Intelligence as an excellent product.

Sunday, July 13, 2014

How Machine Learning is Changing Artificial Intelligence

I recently saw news headlines about the terminator reboot and Stephen Hawking warning about uncontrollable artificial intelligence, and it got me thinking: how close are we to having computers that can think for themselves?

I read an article on BBC in which the author, Tom Chatfield, argued that AI has lost its way.  He compared AI as it was thought of in 1956 to AI today: in 1956, the future of AI was thought to be in creating thinking computers, whereas today, the future of AI is in making computers analyze large datasets and answering problems based on the probability of an answer being correct.  Computers that have beat humans at games, like Watson in jeopardy and Deep Blue in chess, have done so by analyzing data, not by knowing anything about the game they are playing.  The process of analyzing data to learn an answer is called machine learning.  So is Tom correct?  Do computers that “think” using machine learning have false artificial intelligence?

Today’s computers still cannot pass the Turing Test; a test that is used to measure AI.  The test is carried out by placing a human and a computer in separate locations, and the human asks the computer questions.  The human is not told whether or not the answers to his/her questions are coming from another human or a computer.  If the human thinks that their conversational partner is a human, when it is actually a computer, then the computer will have passed the Turing Test.  If a computer can pass that test, then it will be smart enough to be considered intelligent.  Today’s computers are not yet intelligent by a Turing standard, but they are getting closer using machine learning.  So how could machine learning be the wrong direction in the field of AI?

Tom Chatfield references Douglas Hoftstadter’s comments about AI, in which Hoftstadter states that intelligence is “inappropriate for describing insights drawn by brute computing power from massive data sets”.  Hoftstadter argues that fancy programming has nothing to do with how intelligence arises in the human brain.  In my opinion, that is not accurate at all.  In machine learning, a computer draws a conclusion based on comparing new information to historical data.  How is that any different from what humans do?  When we face a new situation, we compare it to what we already know too.  The difference between computers and humans is that when there is no historical data to compare to, humans can still make a good guess, whereas computers cannot.

I think the reason for that difference is that humans can easily break down new information into parts, whereas computers struggle to.  For example, if you had no knowledge of soccer, but you knew everything about every other sport, and you were put into a room with nothing but a soccer ball and a net, how would you figure out what to do?  You could start by comparing the ball to other balls in sports.  You have nothing to strike the ball with, so you probably handle it manually.  It’s about the size of a basketball, so it could be tossed.  Then you look at the net and notice it is much larger than a basketball net.  Since basketball nets are small, they require precision.  It would be too easy to toss a ball into the huge net in front of you.  Furthermore, the net touches the floor.  Maybe the larger net in front of you requires less precision.  Maybe the fact that it touches the floor means the ball can be rolled into it.  Maybe you could kick the ball into the net…  Do you see how that though process works?  You break down the scenario into pieces and compare the pieces to your existing knowledge, rather than trying to match the scenario as a whole to an existing one.  To do that programmatically on a computer is difficult, but it’s almost effortless for human brains.

Imagine a toddler learning to walk, talk, and do whatever.  They have no historical data or experience to compare new information to, so they learn everything by trial and error, and by observing other people.  Anything they learn in the process is stored in their “knowledge base” until proven wrong.  Why couldn’t computers do the same thing?  Perhaps machine learning is not the wrong approach to AI, but an incomplete or evolving one.  To my knowledge, there are no machine learning techniques that entail starting with no historical data and learning everything from scratch.  What would a machine learning technique like that look like?  Consider the following thought experiment…

Suppose you want to create a computer that is capable of learning English, so that it could read a book and know whether or not the book was written in English.  How would you do it?

You could start by defining the English alphabet.  The computer could then loop through each letter of every word in the book, and for each letter, loop through the known alphabet.  If all the letters in that book are contained within the alphabet, then the book is written in English.

This program would work fine if all the letters of the English alphabet were already known.  But suppose you only knew of letters A-W, and had no idea that letters X, Y, and Z existed.  Your program could be modified so that if it found letters that weren’t in the existing alphabet, it would add them to the alphabet, and assume the new alphabet is correct.

Now suppose that the Spanish word niña is in the book.  If your program added ñ to the English alphabet, that would be incorrect.  You could modify your program to evaluate the alphabet at the end of the book.  If any letters are used fewer than x number of times, they would be thrown out of the English alphabet.  You could improve performance further by reading multiple books.  With each new book, the likelihood of your alphabet being the correct English alphabet would increase.  Of course, you would have to move the evaluation part of your program to the very end, otherwise improvements made by reading each new book would be lost.  This is the same way humans learn – assume new information is correct until proven wrong.  To be fair, you could do the reverse – assume new information is wrong until proven correct.  If you did it that way, you would not add letters to the alphabet until you saw them x number of times.

Computers that could learn without having historical data to compare to, or computers that could easily break new information into pieces that could be compared to historical data would be incredibly powerful.  In some cases, they could have perfect data.  For example, a computer making stock trades could have the entire history of the stock market to compare to; something no human brain could do.  If it did have that history, and if it could accurately assimilate new events, then could you say that the computer would be the most experienced trader on Wall Street?  I guess we’ll have to see what the future brings.