Archive for 'Big Data'

perfectionI have noticed that as people age, they become finer and finer versions of themselves. Their eccentricities become sharper and more pronounced; their opinions and ideas more pointed and immutable; their thoughts more focussed. In short, I like to say that they become more perfect versions of themselves. We see it in our friends and acquaintances and in our parents and grandparents. It seems a part of natural human development.

Back in 2006, Netflix initiated the Netflix Prize with the intent of encouraging development of improvements in the accuracy of predictions about how much someone is going to enjoy a movie based on their movie preferences and rewarding the winner with $1,000,000. Contestants were given access to a set of Netflix’s end-users’ movie ratings and were challenged to provide recommendations of other movies to watch that bested Netflix’s own recommendation engine. BellKor’s Pragmatic Chaos team was announced as the winner in 2009 having manage to improve Netflix’s recommendations by 10% and walked off with the prize money.

What did they do? Basically, they algorithmically determined and identified movies that were exceptionally similar to the ones that were already liked by a specific user and offered those movies as recommended viewing. And they did it really well.

In essence what the Bellkor team did was build a better echo chamber. Every viewer is analyzed, their taste detailed and then the algorithm perfects that taste and hones it to a razor sharp edge. You become, say, an expert in light romantic comedies with a strong female lead, who lives in a spacious apartment in Manhattan, includes many dog owners, no visible children and often features panoramic views of Central Park.

Of course, therein lies the rub. A multifaceted rub at that. As recommendation engines become more accurate and discerning of individual tastes they remove any element of chance, randomness or error that might serve to introduce new experiences, genres or even products into you life. You become a more perfect version of you. But in that perfection you are also stunted. You are shielded from experimentation and breadth of experience. You pick a single pond and overfish it.

There are many reasons why this is bad and we see it reflected, most obviously, in our political discourse where our interactions with opposing viewpoints are limited to exchanges of taunts (as opposed to conversations) followed by a quick retreat to the comfort of our well-constructed echo chambers of choice where our already perfected views are nurtured and reinforced.

But it also has other ramifications. If we come to know what people like to such a degree then innovation outside safe and well-known boundaries might be discouraged. If Netflix knows that 90% of its subscribers like action/adventure films with a male hero and lots of explosions why would they bother investing in a story about a broken family being held together by a sullen beekeeper. If retail recommendations hew toward what you are most likely to buy – how can markets of unrelated products be expanded? How can individual tastes be extended and deepened?

Extending that – why would anyone risk investment in or development of something new and radically different if the recommendation engine models cannot justify it. How can the leap be made from Zero to One – as Peter Theil described – in a society, market or investment environment in which the recommendation data is not present and does not justify it?

There are a number of possible answers. One might be that “gut instincts” need to continue to play a role in innovation and development and investment and that risk aversion has no place in making the giant leaps that technology builds upon and needs in order to thrive.

A more geeky answer is that big data isn’t yet big enough and that recommendation engines aren’t yet smart enough. A good recommendation engine will not just reinforce your prejudicial tastes, it will also often challenge and extend them and that we don’t yet have the modelling right to do that effectively.  The data are there but we don’t yet know how to mine it correctly to broaden rather than narrow our horizons. This broadening – when properly implemented – will widen markets and opportunities and increase revenue.

Tags: , , , , , , , ,

lrgThis is the second installment in my irregular series of book reviews for O’Reilly Media. In the interests of full disclosure, I received this ebook for free in exchange for this review. I get to keep it even if I hate it and they will publish this review on their web site even if I trash the volume completely.

The book under the microscope this time is “The R Cookbook” by Paul Teetor. For those of you unfamiliar, R is a powerful, free, open source programming language and environment used for statistical programming and analysis. It features a rich graphical display language to assist in data visualization. You can think of it as a scripting language akin to Excel Spreadsheets or a variant of MATLAB focused on statistics. The language includes a full suite of community-developed, sector-specific libraries that provide re-usable functions typical of industry needs. These libraries indicate the areas in which R has found popularity.This includes the worlds of finance, genomics, statistics and data science.

The R Cookbook describes itself as a book for the user who is somewhat familiar with R but needs easy access to useful techniques and common R program building blocks. The book is arranged as a series of recipes. Each recipe describes a problem that you might trying to solve and then a solution or possible solutions to resolve the issue. For instance “You want a basic statistical summary of your data” is described as the problem to solve and then the text provides you with at least one approach to providing a solution to that problem.

The structure of the book is such that it begins with simpler recipes and builds its way up to more complex ones. In fact, because of this structure, I would recommend this book as a great tool to learning R for the novice despite the book’s self-identification of that being its incorrect use. The rational behind this recommendation is that the beginning recipes are tasks like “How do I install R?” and similar novice tasks. It then builds slowly from there into a use cases and scenarios of increasing complexity and utility.

The book includes great examples that illustrate the power of R in doing data transformation, probability and statistical analysis. It also shows how you can use R to provide meaningful graphical representations of your results. The chapter on ‘Useful Tricks’ is what seals the deal for me providing 19 great pointers to allow you to improve your R analyses.

Tags: , , , ,

BigDataBigBuildingsThere is a huge focus on big data nowadays. Driven by ever decreasing prices and ever increasing capacity of data storage solutions, big data provides magical insights and new windows into the exploitation of the long tail and addressing micro markets and their needs.  Big data can be used to build, test and validate models and ideas Big data holds promise akin to a panacea.  It is being pushed as a universal solution to all ills.  But if you look carefully and analyze correctly what big data ultimately provides is what Marshall MacLuhan described as an accurate prediction of the present.  Big data helps us understand how we got to where we are today. It tells us what people want or need or do within a framework as it exists today.  It is bounded by today’s (and the past’s) possibilities and ideas.

But big data does not identify the next seismic innovation.  It does not necessarily even identify how to modify the current big thing to make it incrementally better

In the October 2013 issue of IEEE Spectrum, an article described the work of a company named Lex Machina. The company is a classic big data play.  They collect, scan and analyze all legal proceedings associated with patent litigation and draw up statistics identifying, for instance, the companies who are more likely to settle, law firms that are more likely to win, judges who are more favorable to defendants or the prosecution, duration and cost assessments of prosecutions in different areas.  So it is a useful tool.  But all it does is tell you about the state of things now.  It does not measure variables like outcomes of prosecution or settlements (for instance, if a company wins but goes out of business or wins and goes on to build a more dominant market share or wins and nothing happens).  It does not indicate if companies protect only specific patents that have, say, an estimated future value of, say, $X million or what metric companies might use in their internal decision making process because that is likely not visible in the data.

Marissa Meyer, the hyper-analyzed and hyper-reported-on CEO of Yahoo!, famously tests all decisions based on data.  Whether it is the shade of purple for the new Yahoo! logo, the purchase price of the next acquisition or value of any specific employee – it’s all about measurables.

But how can you measure the immeasurable?  If something truly revolutionary is developed, how can big data help you decide if it’s worth it? How even can little data help you?  How can people know what they like until they have it? If I told you that I would provide you with a service that lets you broadcast your thoughts to anyone who cares to subscribe to them, you’d probably say.  “Sounds stupid. Why would I do that and who would care what I think?”  If I then told you that I forgot one important aspect of the idea, that every shared thought is limited to 140 characters, you would have likely said, “Well, now I KNOW it’s stupid!”.  Alas, I just described Twitter.  An idea that turned into a company that is, as of this writing, trading on the NYSE for just over $42 per share with a market capitalization of about $25 billion.

Will a strong reliance on big data lead us incrementally into a big corner?  Will all this fishing about in massive data sets for patterns and correlations merely reveal the complete works of Shakespeare in big enough data sets? Is Big Data just another variant of the Infinite Monkey Theorem? Will we get the to point that with so much data to analyze we merely prove whatever it is we are looking for?

Already we are seeing that Google Flu Trends is looking for instances of the flu and finds them where they aren’t or in higher frequencies than they actually are.  In that manner, big data fails even to accurately predict the present.

It is only now that some of the issues with ‘big data’ are being considered.  For instance, even when you have a lot of data – if it is bad or incomplete, you still have garbage only just a lot more of it (that is where wearable devices, cell phones and other sophisticated but merely thinly veiled data accumulation appliances come into play – to help improve the data quality by making it more complete).  Then the data itself is only as good as the analysis you can execute on it.  The failings of Google Flu Trends are often attributed to bad search terms in the analysis but of course, there could be many other different reasons.

Maybe, in the end, big data is just big hubris.  It lulls us into a false sense of security, promising knowledge and wisdom based on getting enough data but in the end all we learn is where we are right now and its predictive powers are, at best, based merely on what we want the future to be and, at worst, are non-existent.

Tags: , , , , , ,
Back to top