Monday, February 16, 2015

How to Make Sparklines with d3.js

A sparkline is a tiny line graph.  Why would you ever want a tiny line graph?  Edward Tufte, the creator of the sparkline, summarizes it best on his website:

“Sparklines mean that graphics are no longer cartoonish special occasions with captions and boxes, but rather sparkline graphic can be everywhere a word or number can be: embedded in a sentence, table, headline, map, spreadsheet, graphic.”
--Edward Tufte

So sparklines are useful for quickly summarizing trending data.  They aren’t needlessly flashy or distracting, and they can be put anywhere to provide context for numbers and figures.  They are especially useful in BI dashboards when space is a luxury and you need to convey the meaning of a number.  The current quarter’s revenue is a good example.  Is the current quarter’s revenue good or bad?  You could make the number green to indicate that it has increased, or red for a decrease, or you could make a full fledged line graph showing revenue history, or even a table of values.  But a sparkline is simpler than any of those, and equally as effective. 

Sparklines are incredibly easy to make with d3.js.  They’re simply miniature line graphs.  Here is an example:


That is what I will explain how to make below.  It is a line showing the quarterly stock price for Apple (APPL) from Q1 2003 to Q2 2013.  I have used the underlying data in a prior blog about chart types with d3.js, so I simply stripped out the code for the line graph and made minor edits to produce this sparkline.  I’ll omit the JSON to keep the code short, but you can link to it from my Gist, or copy and hardcode it. 

To start off, set the width and height of the sparkline to your liking:

var width = 100,
    height = 30;

To simplify the encoding of time in the graph, I concatenated quarter and year for each object in the dataset, and stored it as a new property of the object.

//Add a quarter-year property to each object in the dataset
for (var i in dataset) {
    dataset[i].QYear = dataset[i].Quarter + " " + dataset[i].Year;
}

This is the standard d3 code to append an SVG to a web document.  In this case, I appended to the document body, but you could also create a div element and append to that instead.  The “.append(“g”)” part of the code just groups the SVG so that any transformation you make to it applies to everything in the group. 

//Append an SVG to the document body
var svg = d3.select("body").append("svg")
    .attr("width", width)
    .attr("height", height)
  .append("g");

Next I created the x and y scales.  Since the x-axis is time, and is represented by ordinal data, it uses an ordinal scale.  The domain is the input, which in this case is a function of the data, and in that function I am returning the QYear property of the object, which I created above.  D3 automatically maps the domain to a range, or output.  The rangeRoundBands function makes sure that the intervals are equally subdivided and include padding. 

var x = d3.scale.ordinal()
    .domain(dataset.map(function(d) { return d.QYear; }))
    .rangeRoundBands([0, width], .04);

The y-axis is the stock price, which is a numerical data point, so a linear scale is used.  The domain is a function of the data, and in that function I am returning the StockClose property.  The StockClose is the property corresponding to the closing stock price on the first day of the quarter.  The range function maps the input domain to an output range of the chart height to 0.  Since the origin in SVG coordinates is at the top left of the page, the range is entered backwards.  If you were to enter 0, height, your graph would be upside down because D3 would interpret that as starting the graph at SVG coordinate 0 and ending at the coordinate equal to your height variable. 

var y = d3.scale.linear()
    .domain([0, d3.max(dataset, function(d) { return d.StockClose; })])
    .range([height, 0]);

The next function draws the line.  The interpolate method is a way to smooth your line out.  It is optional, but it works well with sparklines because they are so small.  Without interpolation, the sparkline would look pixilated, like the edge of an image in MS Paint when you zoom in all the way and use the blocky pencil tool to color things.  The x method is a function of the data, d, which returns the x scale function with d.QYear as the argument.  The y method is a function of the data as well, and it returns the y scale function with d.StockClose as the argument.  In these 2 lines, you are essentially applying the scale functions that you built above to properties of the data.

//Draw the line graph
var line = d3.svg.line()
    .interpolate("basis")
    .x(function(d) { return x(d.QYear); })
    .y(function(d) { return y(d.StockClose); })

This section appends a path to the SVG.  You just built the line function, now you want to append a path, passing your dataset as the argument of the line function.  I gave it a class of linePath, but since there is no CSS, you could omit the class assignment.  The stroke attribute gives the line its color, and the stroke-width gives it a width, just like in CSS.

//Append a path to the SVG
svg.append("path")
    .attr("d", line(dataset))
    .attr("class", "linePath")
    .attr("fill", "none")
    .attr("stroke", "#FF0000")
    .attr("stroke-width", 1);

So after combining all of those pieces together, you get the full code:

var width = 100,
    height = 30;

//Create format currency function to format the axes and points
var formatCurrency = d3.format("$,.0f");

//Add a quarter-year property to each object in the dataset
for (var i in dataset) {
    dataset[i].QYear = dataset[i].Quarter + " " + dataset[i].Year;
}

//Append an SVG to the document body
var svg = d3.select("body").append("svg")
    .attr("width", width)
    .attr("height", height)
  .append("g");

var x = d3.scale.ordinal()
    .domain(dataset.map(function(d) { return d.QYear; }))
    .rangeRoundBands([0, width], .04);

var y = d3.scale.linear()
    .domain([0, d3.max(dataset, function(d) { return d.StockClose; })])
    .range([height, 0]);

//Draw the line graph
var line = d3.svg.line()
    .interpolate("basis")
    .x(function(d) { return x(d.QYear); })
    .y(function(d) { return y(d.StockClose); })

//Append a path to the SVG
svg.append("path")
    .attr("d", line(dataset))
    .attr("class", "linePath")
    .attr("fill", "none")
    .attr("stroke", "#FF0000")
    .attr("stroke-width", 1);

Don’t forget to reference d3.js somewhere in your document.  If you want to see it in action, here’s a JSFiddle

Monday, January 19, 2015

Learn Basic SQL Part 4 – Case Statements and Unions

The first 3 parts of these tutorials have covered just about all of the basics of SQL.  There are a large variety of functions out there, but the ones I have shown are arguably the most common and will get you through most situations.  In part 4, I will explain case statements and unions, which are two final topics that I think would be helpful for SQL beginners.  At the end, I’ll list resources for further exploration. 

Case Statements

Case statements test an input for criteria, and when a condition is met, something else is returned.  For example, you could use a case statement to return “Rich” when a person’s salary is above a given threshold, or “Poor” if it is not.  Using the same database as part 3, that would be:

SELECT a.name, b.salary,
    CASE
    WHEN b.salary > 100000 THEN “Rich”
    ELSE “Poor”
    END
FROM employees a
JOIN jobs b ON b.job_code = a.job_code;

Notice that case statements also end with END.  If there were multiple WHEN clauses, none of them would have been separated by commas.  Also note that SQLite requires double quotes instead of single. 
Case statements can also be used to get tallies of how often an event occurs.  You can do this by returning 1 when a condition is met, and 0 otherwise, summing the whole case statement, and then return desired values using a HAVING clause.  For example, here is the SQL to get a count of production engineers who report to Candide Optimist: 

SELECT b.name,
    SUM(CASE
    WHEN (a.manager_id =
        (SELECT id FROM employees
        WHERE name = "Candide Optimist")
    AND a.job_code =
        (SELECT job_code FROM jobs
        WHERE title LIKE "%Production Engineer%"))
    THEN 1
    ELSE 0
    END) count_of_engineers
FROM employees a
JOIN employees b ON b.id = a.manager_id
GROUP BY b.name HAVING count_of_engineers > 0;

I guarantee there are easier ways to get a count of production engineers working for a given manager, but I wanted to show how to do it using a summed case statement. 

Unions

Unions are used to stack records in a database.  Since they are used to combine rows of data, they can only be used when the queries being unionized have the same number of columns.  A good example of when this would be useful can be found on w3schools’ tutorial: combining a list of customer addresses with a list of supplier addresses.  Maybe you want to know all of the cities where your company has operations.  A union query would allow you to stack the datasets and report off of the results.  Since there is no need for one in the database I created for these tutorials, and the w3schools tutorial is easy enough to understand, I won’t include an example here.

Query Examples

1.  Which city has the most employees, and what is the average salary in that city?

SELECT city, Max(average_salary)
FROM (
  SELECT c.city, AVG(b.salary) AS average_salary
  FROM employees a
  JOIN jobs b ON b.job_code = a.job_code
  JOIN location c ON c.location = a.location
  GROUP BY c.city
);

For this task, it is easiest to get the average salary by city first, then select the city with the highest average salary from that.  In order to join salary information from the jobs table with the city names in the location table, the employees table had to be used to join both of them.  It is acceptable to use a third table to join on, even if no fields are being pulled from that third table. 

2.  Is there a difference between the salaries of men and women who have the same job code?

SELECT a.job_code, a.sex, AVG(b.salary)
FROM employees a
JOIN jobs b ON b.job_code = a.job_code
WHERE a.job_code IN
    (SELECT job_code FROM employees
     GROUP BY job_code HAVING COUNT(DISTINCT sex) > 1)
GROUP BY a.job_code, a.sex;

For this task, you only want to see job codes that contain both men and women.  The subquery ensures that you do, by pulling back job codes where the count of distinct instances of sex is greater than 1 (because if a job code had men and women in it, then there would be 2 distinct results).  Looking through the results, you can see that in every job code where there were men and women, the salaries are the same.

3.  Which city has the highest percentage of employees who are 30 or younger? 

SELECT city, MAX(percent_under_30)
FROM
(
SELECT c.city,
    SUM(CASE
        WHEN b.age < 30
        THEN 1.0
        ELSE 0.0
        END)
    / COUNT(a.id) * 100 AS percent_under_30
FROM employees a
LEFT JOIN employees b ON b.id = a.id
JOIN location c ON c.location = a.location
GROUP BY c.city
);

The key here, as with most tasks that require subqueries, is to start from the inside out.  The inside of this code simply pulls the percentage of employees under 30 for each city.  I found it easiest to start simple; build a query that pulls back a headcount under 30, and a total headcount for each location.  Then figure out how to divide them.  Then put that into the subquery, and finally pull in the city name.  Note that since this query calculates a percentage as a real number, you must use decimal points in your case statement to force SQLite to calculate the answer as a real number.  Otherwise SQLite will convert the numbers to integers before dividing, and if you ran the middle part of the code, you would get zeroes for the percentages of people under 30 for 2 of the 3 cities. 

SQL Resources

SQL Fiddle is a free online IDE for writing SQL.  You build a database on the left, and write queries on the right.  You could have done all 4 of my tutorials using this instead of SQLite. 

More tutorials about SQL can be found at w3schools.com/sql and sqlzoo.net.  I found SQLZOO’s exercises to be much better, but they don’t provide a lot of instructions. 

Tuesday, January 6, 2015

Learn Basic SQL Part 3 – Subqueries and Aggregate Data

In part 1 of these basic SQL blogs, I showed how to create databases and tables, and how to manipulate those tables.  In part 2, I showed how to update or add data to tables, and how to query those tables.  At the end of part 2, I described how tables can be joined.  Now in part 3, I’m going to explain how and why you can build subqueries, as well as how data can be aggregated or summarized.  The first block of code below will build a database with a few tables of data.  It will serve as a review of parts 1 and 2, and add complexity to the database that will be needed for subquerying and aggregating. 

C:\>cd sqlite
c:\>SQLite>sqlite3
sqlite>

In previous blogs, I have included the “sqlite>” and “…>” that start every line within the program, because I wanted to comfort beginners by showing them exactly what they would see on their screens.  But from now on I will omit it to keep the code more concise.  (Note that in order to paste the code below into your command line, you have to right click and choose paste.)

.open myCompany.db

CREATE TABLE employees
    (
    id integer primary key,
    name varchar(25),
    age integer,
    sex varchar(1),
    job_code integer,
    location varchar(5),
    manager_id integer
    );

CREATE TABLE jobs
    (
    job_code integer primary key,
    title varchar(30),
    category varchar(20),
    pay_grade integer,
    salary real
    );

CREATE TABLE location
    (
    location varchar(5) primary key,
    city varchar(25),
    state varchar(25)
    );

INSERT INTO employees
    (id, name, age, sex, job_code, location, manager_id)
    VALUES
    (1, "John Smith", 40, "M", 1001, "A100", 1),
    (2, "Jane Smith", 40, "F", 1002, "A100", 1),
    (3, "Alice Wonderland", 20, "F", 1005, "A100", 2),
    (4, "Nick Carroway", 35, "M", 1007, "B100", 6),
    (5, "Blanche DuBois", 30, "F", 1004, "B100", 6),
    (6, "Candide Optimist", 55, "M", 1003, "B100", 1),
    (7, "Quentin Compson", 25, "M", 1004, "C100", 6),
    (8, "April Wheeler", 25, "F", 1005, "A100", 2),
    (9, "Winston Smith", 45, "M", 1005, "A100", 2),
    (10, "Don Juan", 35, "M", 1005, "B100", 2),
    (11, "Bob Cratchit", 30, "M", 1004, "A100", 6),
    (12, "Peregrin Took", 20, "M", 1008, "B100", 4),
    (13, "Rosie Cotton", 20, "F", 1004, "A100", 6),
    (14, "Dolores Haze", 60, "F", 1006, "A100", 1),
    (15, "La Esmerelda", 30, "F", 1008, "B100", 4);

INSERT INTO jobs
    (job_code, title, category, pay_grade, salary)
    VALUES
    (1001, "CEO", "Executive", 10, 200000),
    (1002, "VP of Sales", "Executive", 9, 190000),
    (1003, "VP of Production", "Executive", 9, 150000),
    (1004, "Production Engineer", "Associate", 6, 80000),
    (1005, "Sales Associate", "Associate", 6, 80000),
    (1006, "HR Manager", "Management", 7, 100000),
    (1007, "Production Specialist", "Management", 7, 110000),
    (1008, "Quality Test Engineer", "Associate", 5, 70000);

INSERT INTO location
    (location, city, state)
    VALUES
    ("A100", "Seattle", "Washington"),
    ("B100", "Portland", "Oregon"),
    ("C100", "San Francisco”, "California");

As you can see, there is nothing new so far, but the additional data and tables will allow you to do more with your queries. 

Subqueries

Suppose that you want to get a list of all the employees who report to John Smith.  If you know John Smith’s employee Id, then you could simply write:

SELECT * FROM employees WHERE manager_id = [Insert John’s employee Id here]

But in the real world, you’ll probably have no idea what John’s employee Id is.  This is where a subquery comes in handy.  A subquery is a query within a query.  If that sounded exciting, your instincts were correct; writing queries within queries is just as intense as watching Leonardo DiCaprio create dreams within dreams.  To conceptualize how this might be done, think about how you could break the problem down:
  • Problem:  You want to find all of the employees who report to John Smith.
    • Translation:  Find everyone whose manager_id = John Smith’s employee Id.  
    • This is the main query.
  • Problem:  You need to know John Smith’s employee Id in order to complete the main query.
    • Translation:  Find John Smith’s employee Id.  
    • This is the subquery.
Here is how the SQL would look:

SELECT * FROM employees
WHERE manager_id =
    (SELECT id FROM employees
    WHERE name = "John Smith");

I indented the subquery and put it in parentheses.  This subquery returns 1 value, which is used to filter manager_id in the main query.  You may notice when you run this query that John Smith is returned as his own manager.  If you want to filter him out of the results, you would specify that in the main query, not in the subquery, because you want the subquery to return his employee Id.  So it would become:

SELECT * FROM employees
WHERE manager_id =
    (SELECT id FROM employees
    WHERE name = "John Smith")
AND name != “John Smith”;

As another example, suppose you want a list of Vice Presidents (employees with “VP” in their job titles).  Here is the SQL:

SELECT * FROM employees
WHERE job_code IN
    (SELECT job_code FROM jobs
    WHERE title LIKE “%VP%”);

Notice that “IN” is used instead of an equals sign.  That is because there are 2 Vice Presidents, so the subquery returns 2 results.  If you used an equals sign, then only the VP whose job code matches the first job code returned by the subquery would show up.  The IN is used to show employees whose job codes are in the results returned from the subquery.  The % symbol is a wildcard value and is used in this case to represent anything that could come before or after VP.  If you just put a “VP%” then the query would look for anything that starts with VP.  Also, “%VP” looks for anything that ends with VP, so by boxing the VP in between the wildcard symbols, it is like saying “find VP somewhere in the job title.” 

You may have wondered why a join wasn’t used instead of a subquery.  Many times joins can be substituted for subqueries.  In fact, it is often preferable to use joins because it takes less time to process than a subquery.  Here is how the same query would look with a join instead of a subquery:

SELECT * FROM employees
JOIN jobs ON jobs.job_code = employees.job_code
WHERE jobs.title LIKE “%VP%”;

The tradeoff is that you must specify which fields you want returned, because requesting them all returns all of the fields from employees and jobs. 

Data Aggregation

There are several aggregate functions in SQL, such as COUNT, MAX, MIN, SUM, and AVG.  They are very easy to use.  You just put them in the select statement, like so:

SELECT COUNT(id) FROM employees;

That will give you a count of how many employees there are.  Here’s another example that gives the average salary for employees in the “Associate” job category:

SELECT AVG(salary) FROM jobs WHERE category = “Associate”;

What if you want to see the average salary for each job category?  You can use the GROUP BY function to do this.  For example:

SELECT category, AVG(salary) FROM jobs GROUP BY category;

You can add fields to the GROUP BY clause using commas.  For example, to show the average salary by pay grade for each category, do this:

SELECT category, pay_grade, AVG(salary) FROM jobs GROUP BY category, pay_grade;

In the real world, it is best to be careful how detailed you get when providing salary like this.  The last query pulls back salaries for pay grades in which there are only single people in those grades.  So if you thought about it, you could probably discern who those people are and what they make. 

The GROUP BY statement is often coupled with a HAVING statement.  This is useful when you want to filter aggregated data, because you cannot explicitly aggregate data after a WHERE statement.  For example, a query with “WHERE COUNT(employees) > 10” is not allowed.  But you can insert a HAVING clause to filter results like that instead.  Here is a query that shows the names of employees who have 5 or more people reporting to them:

SELECT name FROM employees
WHERE id IN
(SELECT manager_id FROM employees
GROUP BY manager_id HAVING COUNT(id) >= 5);

Aliases and Self Joins

In the last example, you found the name of the employee who has 5 or more employees reporting to them.  Knowing their name is great, but what if you also want to know how many people report to them?  That requires a self join.  Yes, tables can be joined to themselves.  When you do this, each table needs to be renamed to keep track of which one is which.  You can do this by assigning aliases.  Aliases are assigned either by specifying an “AS [alias here]” or simply by putting the alias after a space after the table name, like “[table name here] [alias here]”. 

The self join is done the same way as a regular join.  To find the count of direct reports for a manager, the employees table has to be joined to itself, but on manager_id = id.  So the manager Id from the first employees table is joined to the id of the second employees table, because you want information from the second table about the manager from the first table.  That may sound confusing, but here is what it looks like:

SELECT a.name, COUNT(b.id) FROM employees a
JOIN employees b ON b.manager_id = a.id
WHERE a.id IN
    (SELECT manager_id FROM employees
    GROUP BY manager_id HAVING COUNT(id) >= 5);

As you can see, the manager name is pulled from the employees table that is renamed as “a”.  That manager name is derived from the same subquery as before – the one that finds the Ids of managers with 5 or more employees.  So the employee Ids from table a are managers (the subquery that filters table a makes sure that only employees who are managers are returned).

I showed you a query earlier that provided the counts of direct reports for a given manager.  Before, the count of reports was derived from the count of employee Ids whose manager Id equaled a certain value.  The same concept applies here.  It cannot be pulled from the “a” table however, because that one is pulling information about each manager as an employee.  You need to get information about the manager as a manager.  To do that, a second employees table (b) is used that is joined by its manager_id to the employee Id of table a.  That way, any information that comes from table b pertains to managers (because table a contains only managers).  This allows you to do a count of the Ids where the manager_id = the employee Id in table a. 

That last query was tough, so I will stop there.  Stay tuned for part 4, which will cover case statements, unions, and a few examples that will tie everything together. 

Wednesday, December 17, 2014

Hiring, House Hunting, and Dating: Making Decisions with Optimal Stopping Theory

There are so many important searches in life: jobs, homes, girlfriends or boyfriends, which college to go to…if only there were a mathematical formula to help us make important decisions.  Oh wait, there is!  It’s called the optimal stopping theory.  While it certainly won’t help you make the best decision all of the time, it can provide guidance.

The Secretary Problem

One example of how optimal stopping theory is applied is the secretary problem.  The secretary problem is a hypothetical situation in which you are searching for the perfect job candidate to fill a secretarial position.  In this scenario, there are a few assumptions made to keep the problem simple:

  • The number of candidates, n, is known.
  • The candidates are ranked best to worst, with no ties, but the ranks are unknown.
  • Only the best candidate is chosen.  Since ranks are unknown, you must guess who the best is.
  • The candidates are interviewed in a random order, and the decision to keep or reject is made immediately after the interview, before proceeding to the next.
  • Once a candidate is rejected, they can never be recalled.
  • Once a candidate is chosen, no more are interviewed.

If you were to pick by trial and error, you might notice that you can usually get pretty close to finding the best candidate by automatically rejecting a few, then picking the next one you find that is better than all the ones you’ve seen until then.  Unless you are unlucky enough to have the best one fall into the first few that you reject, this strategy usually works.  In fact, that is the general idea behind solving the secretary problem.

If you pick the kth candidate, then the minimum number of candidates you pass over is k-1 (because if you pick the first one, you aren’t passing over any).  Your choices are k, k+1, k+2, etc., until you reach the number of candidates, n.  Your goal is to optimize the probability of choosing the best candidate, which you could write as pn(k).

As a good start to solving this problem, you could look at a situation when n = 3.  The candidate ranks are therefore 1, 2, and 3.  Writing out the permutations when n = 3, it becomes clear that the optimal strategy is to pick k = 2.

1 2 3
1 3 2
2 1 3
2 3 1
3 1 2
3 2 1

Consider each row in the table above as a different scenario.  In the first row, the best candidate (the candidate with rank 1) shows up first.  If you pick the candidate that shows up first (in other words, you pick k = 1), then your odds of picking the highest ranked candidate are 2/6.  If you pick the candidate that shows up second, or k = 2, then your odds of picking a candidate whose rank is higher than the first one are 3/6.  If you pick the third candidate, you pick when k = 3, and your odds of picking a candidate who is better than the first two are 2/6.  So as I mentioned, the best strategy is to pick when k = 2.

You could write out the 24 permutations for n = 4, the 120 for n = 5, and so on, finding the optimal kth choice for each number of candidates, but it would quickly become hard to do manually.  Let x be the integer number of the best candidate.  An equation to describe the probability of success, S subscript (n,k), (success meaning you picked the best candidate), could be:


Reading through that: the probability of success is 1/number of candidates, n, when you pick the first candidate.  The probability of success is k-1/number of candidates, n, times the sum of 1/x-1 from x = k to n, when you pick any candidate from the second to the last.  You can test this equation against the case when n=3.  You will get the same probabilities that you manually calculated before.  A proof of how it is derived can be found here.

Now the probability of success, P(Sn,k), has to be optimized by determining what happens when the number of available candidates, n, gets infinitely large.  This requires setting the derivative of the bottom part of the piecewise function above equal to zero to find critical values, and finding the critical value that is the global maximum for the function.  At first it might seem intimidating to take the derivative of that bottom chunk, but it becomes easier when you notice that it can be rewritten as a definite integral by substituting t for k-1/n and u for x-1:

Integration rules give:

t * (ln(1) - ln(t)) = -t * ln(t)

The derivative of –t * ln(t) with respect to t is:  –ln(t) - 1.  Setting this equal to zero and solving for t gives:

-ln(t)-1 = 0
-ln(t) = 1
ln(t) = -1
t = 1/e

You could plug this value into the original equation, but by simply looking at the graph, you can tell that it is the value that maximizes the function.  This means that when you have a large number of candidates, your odds of picking the best one are the highest when you let 1/e (or about 36.8%) of the candidates go by without picking any, and then pick the next candidate who is better than all of the others seen so far.  When you do this, your odds of picking the best one are 36.8%, which may seem low, but it is the highest percentage that you can hope to get.

In the real world, you probably never know the true number of candidates, n.  But you can mitigate your lack of such knowledge by setting restrictions.  For example, if you set the number of people you’re willing to date throughout your life to 200, then you can follow the 36.8% heuristic to find the best mate out of the 200.  It is interesting to think that letting 36.8% of your options go by is like establishing a baseline.  By that point, you will have a good idea of what is out there, and will then be in a position to make a fair judgment about what option is worth choosing.

Note that there are variations of the secretary problem that change the outcome.  For example, you could have group interviews, or the candidate you choose could reject your offer.  You can read more about those scenarios here.

Exploring Optimal Stopping Theory Computationally

I wrote a Python script to test the optimal stopping theory.  You can fork my GitHub repo for it.  The script works by generating a list of 1,000 random numbers between 1 and 100.  The script loops through the list to store the first 36.8% of the variables as a baseline, just like optimal stopping theory recommends, and then finds the next highest number that is higher than all the others before it.  Because the seed is set to allow reproducibility, no matter how many times you run the script, you’ll get the same set of random numbers and therefore the same result.  In my test, the best value, 100, was encountered in the first 36.8% of the observations, and no option following that one was better.  Even though there could have been other 100s after the first one, there were no numbers that could ever exceed it.  You can change the seed to any other number and try it again.  It doesn’t matter what number you pick for the seed, it’s just used for repeatability.  For a seed of 10, the random numbers are shown below:
As you can see, if the numbers are ranked in order of 1 to 1,000, you’ll run across very high numbers rather quickly (the rank order goes from the bottom to the top of the y-axis).  So the result makes sense.  After all, you would expect to find the highest number only 36.8% of the time if the optimal stopping theory holds.

In this blog, I explained the theory of optimal stopping.  I proved why you maximize your odds of picking the best option out of a set of n total options if you let 36.8% of them go by without choosing, and then pick the next option that is better than all of the others you’ve seen until then.  Finally, I simulated a version of the secretary problem with Python, and showed that most of the time you probably won’t pick the best option.  So there is a little math that you can use to go out into the world and maximize your odds of making the best choices in life.

Saturday, November 15, 2014

How to Win at Rock-Paper-Scissors

What if I said that I could teach you how to win at rock-paper-scissors every single time?  No longer will you have to sit in the back seat, eat the smaller piece of cake, or be player two in a video game.  Does that sound too good to be true?  Maybe…but what if I said I could teach you how to win more times than your opponent?  That’s all you really need, and it seems more believable, right?  Well that’s exactly what I’m going to do!

When I was a young boy, I discovered that I could usually win at rock-paper-scissors if I played the hand that my opponent had just played and lost with.  The only requirement was that I win one round, and then I could go on winning indefinitely.  For some reason, most players who lost would choose the hand that would have beaten the hand that won the last round.  If they did not, they chose the third option, but they rarely ever replayed the hand they had just lost with.  They did not even realize they were doing it.

In psychology, this is called loss aversion.  A loss hurts twice as much as a win feels good.  Players who had just lost with paper, for example, would choose anything but paper in the next round.  So grabbing a win and then going with the hand that your opponent loses with is one strategy.  But how does that play out in terms of your odds of winning?

At first, your odds of winning are 1 in 3.  If players continued playing by picking hands at random, your odds of winning would never change.  In the long run, both of you would win approximately 33% of the time.  That is the Nash equilibrium for this game.  But people don’t behave randomly.  If you assume that my previous observation is correct, and that losers will pick the hand that would have beaten the hand that won the last round, then your odds of winning go up to 1 in 2: either they pick that hand that would have won, or they pick the other option.  They’ll rarely pick the same losing hand.

If you lose, it’s better to play the part of the loser.  In other words, pick the hand that would have won the last round.  Winners typically stick with their winning choice, much like gamblers.  In terms of psychology, winning serves as positive reinforcement for choosing that hand.  So if you pick the hand that would have won, and assume that winners will not pick the hand that just lost due to loss aversion, then your odds of winning the next round increase to 1 in 2.

In summary, if your opponent is loss averse, then your best strategies are as follows:

1.    If you win, pick the hand that your opponent lost with for the next round.
2.    If you lose, pick the hand that would have beaten your opponent for the next round.
3.    If you tie, just pick randomly, because your odds of winning are still 1 in 3.

BBC featured a story in May about rock-paper-scissors, and cited a Chinese study that corroborates what I’ve just written.  The BBC story states that winners stay with their winning hand about 33% of the time, and losers shift to the next hand in the rock-paper-scissors sequence (so they would go from rock to paper to scissors).  But contrary to the BBC article, it’s not that losers are picking the next hand in the sequence, it’s that they’re picking the hand that would have won the last round.  It just so happens that if you lose with rock, then the hand that would have won is scissors, and scissors is also the next hand in the sequence.  The same is true for losing with scissors and paper.

Of course, all of the strategies I’ve listed above are speculation.  They rely on the assumption that your opponent is unaware of these same ideas, and is just playing the hand that they feel could win.  If your opponent is aware of these ideas, it could change the game.  Or worse yet, your opponent could know about game theory, and maximizes his/her odds of winning by picking hands at random, every single time.  If that is the case, then your best bet is to choose hands at random as well.  But no matter the scenario, I’ve shown how you can improve your odds of beating your opponent.  Now you can venture out into the world and win at rock-paper-scissors, and by doing so, you can also win at life.

Saturday, November 8, 2014

The Titanic Would Not Have Survived a Head on Collision with the Iceberg

One thing I’ve frequently heard about the Titanic that drives me crazy is that if it hit the iceberg head on, instead of at an angle, it would not have sank.  That’s just absurd, and here is the math to prove it…

According to Google, the Titanic weighed 52,310 tons.  Since weight is a measure of force, which is equal to mass times acceleration due to gravity, the Titanic’s mass was:

Converting US tons to kilograms:  52,310 tons * 907.18474 kilograms per ton = 47,454,833.7494 kg

The Titanic was traveling at about 22 knots when it struck the iceberg, or 11.3178 m/s.  For simplicity, assume that it came to a dead stop upon impact, and assume that it took 1 second for that to occur.  That would make the acceleration be the same as the velocity.  Since force = mass * acceleration, the force exerted by the Titanic if it rammed the iceberg would have been:

Force =  47,454,833.7494 kg  * 11.3178 m/s = 537,084,317.41 N

According to Newton’s third law, every force has an equal and opposite reaction.  So the impact would have exerted just as much force on the Titanic.  The Titanic’s hull was made of carbon steel, which has a shear strength of 74,000 psi.  To convert Newtons per square meter to pounds per square inch, I’m going to assume the Newtons were exerted uniformly across every inch of the hull.  Since there are 1,550 square inches in 1 square meter, the force exerted on the Titanic in Newtons per square meter was:

537,084,317.41 N * 1,550 square inches = 832,480,692,000 Newtons per square meter

Since 1 Newton per square meter = 0.00015 psi, the force exerted in psi would have been:

832,480,692,000 * 0.00015 = 124,872,103.8 psi

That would have been more than enough force to break the carbon steel hull of the Titanic.  In fact, it would have been enough to shred it like paper.  The impact would have caused the steel to crumple and split into pieces, sending ripples down the entire length of the ship.  Everything inside would have been violently thrown forward like in a car crash without seatbelts.  The ship would have busted apart at the seams and sunk almost immediately.  Nobody would have survived.

For anyone who still harbors any doubt, consider the following…

Would the Titanic have broken the iceberg in half?

If the iceberg was anywhere from 200-400 feet long, as some sources suggest, it could have been at least as thick.  Trying to cut it in half with a ship would be like trying to cut a brick of ice apart with a butter knife.  It’s just not going to happen.  Furthermore, only a fraction of icebergs are above the surface of the ocean.  If the iceberg that the Titanic hit were 100 feet high as sources suggest, it could have extended many hundreds of feet below the surface.  Again, there is no way a ship so much smaller than the size of the berg would have stood a chance at splitting it apart.

Would the Titanic have pushed the iceberg out of the way?  

Momentum must be conserved in a collision.  Since there would not have been energy conversion when the Titanic struck the iceberg, it would have been an elastic collision.  Assuming that the ship came to a dead stop, transferring all of its momentum to the iceberg, the new velocity of the iceberg would have been:

mt1vt1 + mi1vi1 = mt2vt2 + mi2vi2

( 47,454,833.7494 kg * 11.3178 m/s ) + ( 150,000,000 kg * 0 m/s ) = ( 47,454,833.7494 kg * 0 m/s ) + ( 150,000,000 kg * x m/s )

537084317.4 kg * m/s = 150000000 kg * x m/s

x = 3.5806 m/s

So after the Titanic hit, the iceberg could have theoretically moved at 3.5806 m/s or about 8 mph, slowing quickly due to friction caused by the ocean water.  However, since the majority of an iceberg lies beneath the water, the Titanic would have struck the top part of it.  Like a bobble head doll, it more than likely would have just wobbled than moved.  For an object of that size, any movement would have been barely noticeable.  So there is no way the Titanic could have pushed it out of the way.

Saturday, September 13, 2014

A BI Analyst’s Review of OBIEE (Oracle Business Intelligence Suite)

Oracle Business Intelligence Suite Enterprise Edition (OBIEE) is Oracle’s premier business intelligence platform.  I wrote a review for it on TechnologyAdvice.com that I decided to share in this blog.  So when you’re done reading, head on over to TechnologyAdvice.com to read or write reviews for other BI platforms. 

A Comprehensive Business Intelligence Suite
Overall Rating:  4/5

I am a business intelligence analyst who has used Oracle Business Intelligence Enterprise Edition (OBIEE) and Oracle Enterprise Performance Managements (Hyperion EPM) for BI reporting. As a power user, I cannot speak to the cost or ease of system integration of Oracle Business Intelligence, but I can attest that its features are leaps and bounds beyond its predecessor, and it is easy to use once you learn how.

OBIEE is built upon a repository with 3 layers of metadata. The physical layer (bottom layer) holds the data, primary and foreign keys, and connections. The business model mapping layer (middle layer) is where the business logic, aggregation and structuring occurs. The presentation layer (top layer) is where data is arranged and presented to the end users. The presentation layer is also where security roles are assigned. Since most users will only see data from this presentation layer, it helps standardize organizational reporting so that bad queries do not result in different teams of people ending up with different data. 

Layering allows OBIEE to connect to multiple data sources and combine them all in the presentation layer. For example, you can build a report with connections to a relational table and an Essbase cube. You read that right: you could build a report using a seamlessly integrated combination of SQL and MDX, without knowing how to write either one!

OBIEE allows alerts and agents to be set up for scheduled processes. Agents are like advanced alerts that can trigger other processes depending on whether or not certain conditions are met. The automation that this feature allows is tremendous.

Mobile Application Developer is another product in the suite that allows users to build HTML 5 applications that can be viewed on any mobile device. The interface reminds me of Microsoft PowerPoint, and you simply drag and drop charts, tables, filters, etc. onto the screen. Once the app is built, you can scan the QR code to view it on your phone. Since the apps are built with HTML, they can also be viewed in a web browser. 

Overall, I would rate Oracle Business Intelligence as an excellent product.