Continuing with the Good Judgment Project mentioned in the previous post, I made some predictions about the Taiwan election (before the day of the election, of course). Frankly, the outcome of the presidential race never had confidence intervals large enough to make things interesting. Fortunately, the GJP assignment is to predict seat distribution of the Legislaive Yuan (LY), which is believed to be a tighter race than the presidency.
A short summary: I got lucky with my rough analysis. I predicted 65 seats for DPP, 35 for KMT, with 13 taken by other parties. The results were 68 for DPP, 35 for KMT, and 10 for other parties. My predictions were very close.
Numerous polls conducted by media and polling companies showed that in the past one to two years, the electorate has been deeply unsatisfied with the ruling KMT government. The numbers have been generally stable, with dissatisfaction over 50% and satisfaction at less than 30%. While this does not bode well for the nationalists, I realized this could make prediction accuracy a little higher.
My idea was to study the pattern of the various Taiwan elections of the past couple of years, and applying them to the current election in 2016.
I have two choices: the 2012 LY election or the 2014 regional election. At first it may be obvious to choose the 2012 LY election, since we are predicting another LY election in 2016; however, the regional election had the advantage of being much closer to the actual 2016 election date, which minimizes variation, especially since our strategy is to extrapolate old results to this new one. The general atmosphere of the 2014 election is much more similar to 2016 than the 2012 election. It took place after the Sunflower movement and the occupation of the Yuan, an event that caused the approval rating of the government to tumble. Since then, the government's approval rate has stabilized. For this reason, I decided to analyze the 2014 election.
Opinion polls data are available on Wikipedia under the page "2014年中華民國直轄市長選舉及縣市長選舉". In summary, the DPP performed significantly better than expected by pundits and won a landslide. After going through each county and city, calculating the average difference between the polls and the final election results, the final results ended up being a whooping 18% higher on average than the polls suggested. Furthermore, in races where the results were close (difference less than 10%), the DPP performed 23% better. This wave of unexpected support resulted in the DPP snatching a good number of seats in traditionally KMT strongholds.
So how to explain such a large difference? If we assume the polls were accurate, and no committed voter changed
his/her mind, then that means around 2/3 of voters who were uncommitted in
the polls were won over by the DPP in the final moments of the
election (On average, 30% of respondents to pollsters were undecided). Since studies consistently showed that generally sizable
number of voters decide only toward the end of the election, it would not be a surprise to see another last minute surge, especially given the great similarity between the general mood between these two elections.
The final aggregate numbers from the polls just before the polling blackout period (which is roughly a week before the election) indicate DPP support at 31%, KMT at 18%, and undecided at 24%.
Therefore, assuming 2/3 of undecided voter go with the DPP, we may see actual DPP support over 45%. This is a little more
than KMT support in the 2012 Legislative Yuan elections, in which they won
64 seats.
The LY consists of members directly elected based on popular vote, and members elected in districts based on first-past-the-post. Of the 64 seats won by the KMT in 2012, only 16 are directly elected members, the rest coming from local districts. I am not familiar
enough with Taiwan to make any district-by-district analysis; however, past elections show that KMT has a wider spread of support than the DPP, meaning DPP support tends to be clustered in certain regions. The implication here is that the DPP needs a larger share of the popular vote to win the same number of seats as the KMT.
However, I predicted that this disadvantage is offset by the phenomenon of strong DPP surge in close districts observed in 2014. It is difficult to quantify how much effect this would have without going into a district by district analysis. I roughly estimated this would even out the disadvantage for the DPP due to the more clustered support base.
In the end, based on the extrapolated popular support data, the DPP should win roughly the same number of seats as the KMT did in 2012 (65 seats). KMT support this year is even lower than DPP in 2012, when they won 40 seats, so I pegged KMT seats at 35 (a lucky guess!). People First Party support seems stagnant, so major seat change is not expected. The additional support bled away by the KMT should be picked up by the new New Power Party, according to polling data, which should let them gain around 5 seats.
The final results were very close to my prediction, with DPP winning 68, KMT at 35, NPP at 5, and PFP at 3. The result affirms my presupposition that popular mood remains the same as in 2014, and the same mood among the electorate produces the same result. The prediction would have been significantly more difficult if a major shift in opinion was detected by pollsters.
Welcome to Steven Xiao Han Dong's personal site
Apple iPhone Forecasting
I figured that it would be good for me, career wise, to polish my numerical analysis skills. I had precious experience doing physics labs back in undergrad. Apart from the fact that the experiments were cool physics stuff (shining lasers, ultra-sensitive particle detectors etc.), it also trained me to approach data analysis mathematically and systematically.
However, I have been out of practice for some time, and I was looking for ways to keep my skills polished. I stumbled across Good Judgment Open (www.gjopen.com), an open site that allows anyone to make forecasts on an array of real life questions. Apparently it was started a number of years ago by a Wharton professor to study what kind of person makes a good predictor, and it has become very popular in the community. I thought this would be a perfect way for me to practice some data analysis.
The first challenge I took on is the prediction of whether iPhone sales in Q4 of 2015 (ending in December) will exceed 75 million units. The strategy is to use simple extrapolation to arrive at a number. The 4th quarter is the first full quarter after the release of the iPhone 6S and 6S Plus on Sept 25th 2015, roughly a year after the release of iPhone 6. the first quarter after iphone 6 sales peaked at 74.5 million. However S versions tend to perform less than brand new versions.
I first approached the problem from a pure data fitting approach. Over the past 4 years (2011-2015), iPhones have been released at around the same time each year. Forming a periodic data set with 16 data points, one for each quarter.
From the looks of it, Apple should easily exceed 75 million mark come the first quarter. But the purpose of these exercises for me is to practice my numerical analysis skills, so I should crunch some numbers first.
Next I fitted a quadratic model to the data: it seemed to make sense that the sales rate of change over time would be linearly decreasing. I averaged the quadratic coefficients of the previous four years to find typical sales rate values. In other words, I took the coefficients of the first and second order terms of the model, believing that the higher order terms would be more stable. To find the constant term of the quadratic, I calculated the annual sales averages of the previous years and performed a least square regression to obtain a expected average sales per quarter of 72 million. In other words, the average quarterly sales of the entire year is already approaching the 75 million threshold. From the annual average, I can calculate for the constant term. Applying the predicted model to this past quarter yields a value of 83 million units, well exceeding the threshold. However, when I tried to find the confidence intervals, the variance on my curve fit coefficients were so large that any conclusion is hard to justify with certainty (i.e. pretty much anything from 0 to twice the value would fit into the confidence interval).
I tried to fit to the exponential decay model in hopes of getting better uncertainties. I fitted the previous years and extrapolated into this past quarter using similar methods. For exponential models, I need to assume some equilibrium values for the models when the transients have decayed away completely. For this model I tried two different values as the equilibrium value. First I tried the average sales of the previous year. Second, I tried the minimum quarterly sales of the previous year. Both of these try to exploit the fact that in the previous four years, a new iPhone has never sold slower than the previous version for any quarter. I averaged the decaying time constants and assumed the averaged time constant is close to the model for the new year. After a linear extrapolation of the expected average in the new quarter, all the other coefficients can be computed. However, when I calculated the expected sales numbers for the first quarter, the results were 60 million and 50 million respectively, again with very large uncertainties.
So far the analysis has been inconclusive. 75 million is roughly in the middle of the two models, so naively one might think the probability of hitting the target is around 50%. The data, however, is just too sparse to conclude anything with certainty.
Nevertheless, there are still some interesting properties that can be extracted, particularly with the decay model. Looking at the decay rates, the one with the least sustained popularity is the 2013 products: the iPhone5 released in late 2012 with a decay rate of 1.26 per quarter. This is followed by the iPhone 5C and 5S release the following year with a decay rate of 1.00 per quarter, then the iPhone 6 at 0.65 per quarter. The product with the most sustaining popularity is the iPhone 4S released late 2011 with a decay rate of just 0.25 per quarter. This is only one side of the picture, however, as year over year growth of overall sales has been growing steadily during this period. Considering this effect, the newer iPhone 6 can be considered quite a success. Despite a large spike upon initial release, it nevertheless was able to sustain its popularity into the second quarter. It would be interesting to see the statistics of the 6s.
A final method of analyzing the market share is used. Apple's market share of global mobile phones has increased in the past number of years, even as global mobile phone cells have stabilized around 450 million units per quarter in recent years.
Extrapolating the linear fit (which, compared with my previous models, gives a much smaller uncertainty), the first quarter of 2015 should result in 83 million sales, with a fairly small confidence interval. This is 8 million more than the 75 million unit threshold. Overall, this calculation gives a more optimistic result compared to my previous analyses.
Therefore, the answer is less certain than what I first thought when I saw the initial graph of the promising year over year growth. Which models should I go with? Should I just simply average them out or should I weight them according to their variances? My last analysis tipped the scale for me due to its small uncertainty. In the end, I decided to play safe and go with 60% confidence (within one standard deviation) that iPhone 6S will exceed 75 million unit sales in its first quarter.
However, I have been out of practice for some time, and I was looking for ways to keep my skills polished. I stumbled across Good Judgment Open (www.gjopen.com), an open site that allows anyone to make forecasts on an array of real life questions. Apparently it was started a number of years ago by a Wharton professor to study what kind of person makes a good predictor, and it has become very popular in the community. I thought this would be a perfect way for me to practice some data analysis.
The first challenge I took on is the prediction of whether iPhone sales in Q4 of 2015 (ending in December) will exceed 75 million units. The strategy is to use simple extrapolation to arrive at a number. The 4th quarter is the first full quarter after the release of the iPhone 6S and 6S Plus on Sept 25th 2015, roughly a year after the release of iPhone 6. the first quarter after iphone 6 sales peaked at 74.5 million. However S versions tend to perform less than brand new versions.
I first approached the problem from a pure data fitting approach. Over the past 4 years (2011-2015), iPhones have been released at around the same time each year. Forming a periodic data set with 16 data points, one for each quarter.
From the looks of it, Apple should easily exceed 75 million mark come the first quarter. But the purpose of these exercises for me is to practice my numerical analysis skills, so I should crunch some numbers first.
Next I fitted a quadratic model to the data: it seemed to make sense that the sales rate of change over time would be linearly decreasing. I averaged the quadratic coefficients of the previous four years to find typical sales rate values. In other words, I took the coefficients of the first and second order terms of the model, believing that the higher order terms would be more stable. To find the constant term of the quadratic, I calculated the annual sales averages of the previous years and performed a least square regression to obtain a expected average sales per quarter of 72 million. In other words, the average quarterly sales of the entire year is already approaching the 75 million threshold. From the annual average, I can calculate for the constant term. Applying the predicted model to this past quarter yields a value of 83 million units, well exceeding the threshold. However, when I tried to find the confidence intervals, the variance on my curve fit coefficients were so large that any conclusion is hard to justify with certainty (i.e. pretty much anything from 0 to twice the value would fit into the confidence interval).
I tried to fit to the exponential decay model in hopes of getting better uncertainties. I fitted the previous years and extrapolated into this past quarter using similar methods. For exponential models, I need to assume some equilibrium values for the models when the transients have decayed away completely. For this model I tried two different values as the equilibrium value. First I tried the average sales of the previous year. Second, I tried the minimum quarterly sales of the previous year. Both of these try to exploit the fact that in the previous four years, a new iPhone has never sold slower than the previous version for any quarter. I averaged the decaying time constants and assumed the averaged time constant is close to the model for the new year. After a linear extrapolation of the expected average in the new quarter, all the other coefficients can be computed. However, when I calculated the expected sales numbers for the first quarter, the results were 60 million and 50 million respectively, again with very large uncertainties.
So far the analysis has been inconclusive. 75 million is roughly in the middle of the two models, so naively one might think the probability of hitting the target is around 50%. The data, however, is just too sparse to conclude anything with certainty.
Nevertheless, there are still some interesting properties that can be extracted, particularly with the decay model. Looking at the decay rates, the one with the least sustained popularity is the 2013 products: the iPhone5 released in late 2012 with a decay rate of 1.26 per quarter. This is followed by the iPhone 5C and 5S release the following year with a decay rate of 1.00 per quarter, then the iPhone 6 at 0.65 per quarter. The product with the most sustaining popularity is the iPhone 4S released late 2011 with a decay rate of just 0.25 per quarter. This is only one side of the picture, however, as year over year growth of overall sales has been growing steadily during this period. Considering this effect, the newer iPhone 6 can be considered quite a success. Despite a large spike upon initial release, it nevertheless was able to sustain its popularity into the second quarter. It would be interesting to see the statistics of the 6s.
A final method of analyzing the market share is used. Apple's market share of global mobile phones has increased in the past number of years, even as global mobile phone cells have stabilized around 450 million units per quarter in recent years.
Extrapolating the linear fit (which, compared with my previous models, gives a much smaller uncertainty), the first quarter of 2015 should result in 83 million sales, with a fairly small confidence interval. This is 8 million more than the 75 million unit threshold. Overall, this calculation gives a more optimistic result compared to my previous analyses.
Therefore, the answer is less certain than what I first thought when I saw the initial graph of the promising year over year growth. Which models should I go with? Should I just simply average them out or should I weight them according to their variances? My last analysis tipped the scale for me due to its small uncertainty. In the end, I decided to play safe and go with 60% confidence (within one standard deviation) that iPhone 6S will exceed 75 million unit sales in its first quarter.
MATLAB vs Python: finite difference
We were doing a homework question with coding components, part of which involved taking derivatives. Being in an engineering lab, the go-to language is often MATLAB. Unaware of the much more straightforward diff function which my classmate used, I constructed a finite difference matrix using spdiags. As a result, the code was several lines longer, as I needed to create a few matrices and arrays as inputs to the spdiags function. On top of that, there was a matrix multiplication at the end instead of a straightforward array manipulations. Out of curiosity, I timed the execution time of both code on my laptop. To my amazement, the matrix formulation easily won; I tested each code multiple times to account for random machine variance over time, and not a single time did the other code execute faster. Now my curiosity was really peaked.
Anyone who have used MATLAB knows how slow it is. This, among other reasons, makes people rightly conclude that it is not a real programming language. Mathworks never claimed it to be a top choice for app programmers, but it has been selling MATLAB as a scientific computing language. In particular, it has been said that its algorithms that deal with matrices is a big selling point. But what were the hard numbers surrounding this claim? Numbers were not readily available (although digging a little deeper, some material could be found online), so I thought to do a few of my own tests.
My tests compared four finite difference methods on two languages: built-in difference function, matrix, simple convolution, and Gaussian convolution methods. I have never seen the last two methods, but they seem to be common in image processing community due to easy implementation. Quick explanation: the difference in finite difference could be seen as a convolution of the original function with the array [1, -1] - the simple convolution method. Convolution with a Gaussian window applies Gaussian smoothing to the signal to obtain smoother curves. I coded in MATLAB and Python, a language I was recently learning and which I found to be both quick and nimble. I ran the code a dozen times or so, threw out the outliers, and found the mean.
The Python distribution tested is the popular Enthought Canopy. This comparison is particularly meaningful because Canopy is also marketed as a scientific computing platform generally obtains good reviews.
The following table tells an interesting story:
The units are in milliseconds. MATLAB gets easily beaten by most of the markers here. The only MATLAB functionality that won out is its matrix methods. As mentioned earlier, its optimized matrix algorithms allows for faster computation, although setting up the formulation in MATLAB takes some effort. After the first two times using spdiags I find it becomes straightforward. I suppose this difficulty just makes those who knows how to do it so much more valuable. Matrices are known to be computationally intensive, so for an un-optimized language like python, it is not surprising to find it struggling the most with matrix operations.
MATLAB does not have special functions for Gaussian filters. MATLAB implementations of Gaussian convolution, therefore, would use the same conv function as the simple convolution method, guaranteeing an execution time no faster than simple convolution.
If one is just looking to write pure numerical code, Python seems a good choice, unless there is a heavy amount of linear algebra involved. For many users, however, other functionalities are also needed. MATLAB's plethora of toolboxes are nice, but with some work, one could usually find modules with similar functions in Python. For my homework, I needed to present my computed results using graphs. Here is the result when I timed my entire code for the assignments:
This is something that I intuitively knew. MATLAB plots seemed to pop-up just a tad slower than Python plots, and this is now evident in the tests.
These tests illustrated both the weaknesses and strengths of this computing language. MATLAB is great for doing hard-core linear algebra work, and to a lesser extent, general numerical work, but as soon as it begins to get fancy, be it adding a GUI or producing pretty pictures, it may be better to look elsewhere.
Research Expenditure Data
Research expenditure is an interesting research benchmark. Few ranking systems take it into account, possibly because this data is not always available world wide. However, anyone who has some research experience, especially experiment-related, knows the importance of having a well-funded lab. In a scientific community increasingly embracing "Big Science", the importance of this benchmark will only become more important.
As mentioned earlier, research expenditure data is not always easily available. With some digging, however, I have been able to find engineering research expenditure data from NSF from US and a few universities from Canada. The result is a fairly interesting graph:
Initially it was a surprise to see Johns Hopkins way ahead at the top. On further consideration, perhaps it is understandable; life science and medical research generally require the most funding, and as JHU really focuses in this area, the focus of its engineering research has been in the biomedical area also. But I don't think this is the whole story.
If we consider just the government contributions, the top 5 is listed in the following table (note the dollar figures use a base in thousands):
Not only does JHU receives more than the combined amount of second and third place, and 6% of all government fundings, it is the only private university in the top 5.
I was initially puzzled by this, until I read the fine prints of the NSF data - the data includes APL funding. Affiliated with JHU due to historical reasons, APL is essentially a government defense contractor, with close links to DoD and NASA. This fact would not only explain JHU, but also surprising entrants like CNSE at SUNY Albany. These institutions may not feature well in ranking systems that base research excellence on publication and citation counts because a reasonable number of projects are strictly classified. I would not be surprised, however, if there is much world-class expertise hidden in these institutions. It would be interesting to apply research expenditure as a metric to one of the existing ranking algorithms and see how the rankings would change.
For Canada, U of T , UBC, and Waterloo receives the most NSERC funding, and were thus the subject of my interest. U of T squeaks into the top 40 while UBC is hovering at 70. Waterloo data was not easily available. Extrapolating from NSERC data, it should be somewhere between U of T and UBC, around the top 60 mark. Although somewhat underwhelming when compared to universities in a country with a significantly larger GDP, Canada has been consistently showing improvement year over year compared to US schools. Tri-council funding has consistently increased while American government funding has stagnated or even decreased. American public schools have been particularly affected, due to more limited third-party contributions. University of Toronto, for example, has boosted its funding to over 80 million in the most recent fiscal year, which would boost it up into the top 30, should-to-shoulder with such distinguished schools as Utah State and SUNY Buffalo, but ahead of the likes of Princeton and Duke in 2013 terms.
Some technical notes:
This is my first mini project done in Python using the pandas module. Some familiarity with the module is achieved, but I find the module is narrowly focused on data manipulation. For convenience in doing calculations, it was faster to convert back to the old familiar numpy arrays and converting back to pandas dataframe after the calculations. This is clearly sub-optimal, and hopefully better solutions exist.
As mentioned earlier, research expenditure data is not always easily available. With some digging, however, I have been able to find engineering research expenditure data from NSF from US and a few universities from Canada. The result is a fairly interesting graph:
Initially it was a surprise to see Johns Hopkins way ahead at the top. On further consideration, perhaps it is understandable; life science and medical research generally require the most funding, and as JHU really focuses in this area, the focus of its engineering research has been in the biomedical area also. But I don't think this is the whole story.
If we consider just the government contributions, the top 5 is listed in the following table (note the dollar figures use a base in thousands):
Not only does JHU receives more than the combined amount of second and third place, and 6% of all government fundings, it is the only private university in the top 5.
I was initially puzzled by this, until I read the fine prints of the NSF data - the data includes APL funding. Affiliated with JHU due to historical reasons, APL is essentially a government defense contractor, with close links to DoD and NASA. This fact would not only explain JHU, but also surprising entrants like CNSE at SUNY Albany. These institutions may not feature well in ranking systems that base research excellence on publication and citation counts because a reasonable number of projects are strictly classified. I would not be surprised, however, if there is much world-class expertise hidden in these institutions. It would be interesting to apply research expenditure as a metric to one of the existing ranking algorithms and see how the rankings would change.
For Canada, U of T , UBC, and Waterloo receives the most NSERC funding, and were thus the subject of my interest. U of T squeaks into the top 40 while UBC is hovering at 70. Waterloo data was not easily available. Extrapolating from NSERC data, it should be somewhere between U of T and UBC, around the top 60 mark. Although somewhat underwhelming when compared to universities in a country with a significantly larger GDP, Canada has been consistently showing improvement year over year compared to US schools. Tri-council funding has consistently increased while American government funding has stagnated or even decreased. American public schools have been particularly affected, due to more limited third-party contributions. University of Toronto, for example, has boosted its funding to over 80 million in the most recent fiscal year, which would boost it up into the top 30, should-to-shoulder with such distinguished schools as Utah State and SUNY Buffalo, but ahead of the likes of Princeton and Duke in 2013 terms.
Some technical notes:
This is my first mini project done in Python using the pandas module. Some familiarity with the module is achieved, but I find the module is narrowly focused on data manipulation. For convenience in doing calculations, it was faster to convert back to the old familiar numpy arrays and converting back to pandas dataframe after the calculations. This is clearly sub-optimal, and hopefully better solutions exist.
Subscribe to:
Posts (Atom)