Monday, June 26, 2017

Data Science and Machine Learning Without Mathematics

There is a set of techniques covering all aspects of machine learning (the statistical engine behind data science) that does not use any mathematics or statistical theory beyond high school level. So when you hear that some serious mathematical knowledge is required to become a data scientist, this should be taken with a grain of salt.
The reason maths is a thought to be a requirement is because of the following reasons:
  • Standard tools such as logistic regression, decision trees or confidence intervals, are math-heavy
  • Most employers use standard tools
  • As a result, hiring managers are looking for candidates with a strong math background, mostly for historical reasons
  • Academic training for data scientists are math-heavy for historical reasons (using the professors that used to teach stat classes)
Because of this, you need to really be math savvy to get a "standard" job, so sticking to standard math-heavy training and standard tools work for people interested in becoming a data scientist. To make things more complicated, most of the courses advertised as "math-free" or "learn data science in three days" are selling you snake oil (it won't help you get a job, and many times the training material is laughable.) You can learn data science very quickly, even on your own if you are a self-learner with a strong background working with data and programming (maybe you have a physics background) but that is another story.
Yet there is a set of techniques, designed by a data scientist with a strong mathematical background and long list of publications in top statistical journals that does not use mathematics nor statistical modeling. These techniques work just as well and some of them have been proved to be equivalent to their math-heavy cousins, with the additional bonus of generally being more robust. They are easy to understand and lead to easy interpretations, yet it is not snake oil: it is actually based on years of experience processing large volumes of diverse data, mostly in automated mode.
If you create your own startup, develop your own data science consultancy, or work for an organization that does not care about the tools that you use -- as long as they are cheap, easy to implement, and reliable -- you might consider using these simple, scalable, math-free methods. For instance, if you develop algorithms for stock trading, you wouldn't want to use the same tools as your competitors. These math-free techniques can give you a competitive advantage.
Below, I describe several math-free techniques covering a good chunk of data science, and how they differ from their traditional math-heavy cousins. I use them pretty much every day, though most of the time, in some automated ways.

Sunday, June 25, 2017

Advanced Machine Learning with Basic Excel

In this article, I present a few modern techniques that have been used in various business contexts, comparing performance with traditional methods. The advanced techniques in question are math-free, innovative, efficiently process large amounts of unstructured data, and are robust and scalable. Implementations in Python, R, Julia and Perl are provided, but here we focus on an Excel version that does not even require any Excel macros, coding, plug-ins, or anything other than the most basic version of Excel. It is actually easily implemented in standard, basic SQL too, and we invite readers to work on an SQL version.

Who should use the spreadsheet?

First, the spreadsheet (as well as the Python, R, Perl or Julia version) are free to use and modify in any context, even commercial, and even to make a product out of it and sell it. It is part of my concept of open patent, in which I share all my intellectual property publicly and for free. 

The spreadsheet is designed as a tutorial, thought it processes the same data set as the one used for the Python version. It is aimed at people that are not professional coders, people who manage data scientists, BI experts, MBA professionals, and people from other fields, with an interest in understanding the mechanics of some state-of-the-art machine learning techniques, without having to spend months or years learning mathematics, programming, and computer science. A few hours is needed to understand the details. This spreadsheet can be the first step to help you transition to a new, more analytical career path, or to better understand the data scientists that you manage or interact with. Or to spark a career in data science. Or even to teach machine learning concepts to high school students.

The spreadsheet also features a traditional technique (linear regression) for comparison purposes.

Click here to read this article, download the spreadsheet, and start using it.

Friday, June 23, 2017

12 Interesting Reads for Math Geeks

Many data scientists have a passion for mathematics, and many modern math problems can be explored using data science. Below is a selection of interesting articles, many about challenging, deep mathematical problems, by a data scientist who developed math-free algorithms. Some of these articles cover statistical theory and thus belong to data science, some are just about mathematics and number theory for its own sake. Most of them can be understood by the layman. Some include R code to produce visualizations, and some include processing vast amounts of data -- trillions of data points: thus it provides an excellent sandbox to test distributed architecture implementations, and high performance computing.

Math model: Tessellation
12 Interesting Reads for Math Geeks
DSC Resources

Thursday, June 22, 2017

My Best Data Science, Machine Learning and Related Articles

Here I list my most interesting contributions published on Data Science Central. My plan is to categorize and aggregate this content to produce a few self-published books. The material below will always be available for free (from this webpage), but the books won't, or if they are, they will be free for members only. So you might want to bookmark this page.   
I also have written a number of academic papers, you can find some of them here
My home office, where I write most of my DSC articles 
The articles below are listed in reverse chronological order. This is a work in progress: I am still adding older entries. So check back again in a few weeks!
1. Core Articles
2. Blog Posts About Data Science
3. Other Blog Posts
4. Guides and References
5. Repositories
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

My Data Science Journey

I describe here the projects that I worked on, as well as career progress, starting 25 years ago as a PhD student in statistics, until today, and the transformation from statistician to data scientist that occurred slowly and started more than 20 years ago. This also illustrates many applications of data science, most are still active.
Early years
My interest in mathematics started when I was 7 or 8, I remember being fascinated by the powers of 2 in primary school, and later purchasing cheap russian math books (Mir publisher) translated in French, for my entertainement. In high school, I participated in the mathematical olympiads, and did my own math research during math classes, rather than listening to the very boring lessons. When I attended college, I stopped showing up in the classroom altogether - afterall, you could just read the syllabus, memorize the material before the exam and regurgitate it at the exam. Moving fast forward, I ended up with a PhD summa cum laude in (computational) statistics, followed by a joint postdoc in Cambridge (UK) and the National Institute of Statistical Science (North Carolina). Just after completing my PhD, I had to do my military service, where I learned old data base programming (SQL on DB2) - this helped me get my first job in the corporate world in 1997 (in US), where SQL was a requirement - and still is today for most data science positions.
My academia years (1988 - 1996)
My major was in Math/Stats at Namur University, and I was exposed between 1988 and 1997 to a number of interesting projects, most being precursors to data science:
  • Object oriented programming in Pascal and C++
  • Writing your own database software in Pascal (student project)
  • Simulation of pointers, calculator and recursion, in Fortran or Pascal
  • Creating my own image processing software (in C), reverse-engineering Windows bitmaps formats and directly accessing memory with low-level code to load images 20 times faster than Windows
  • Working on image segmentation and image filtering (signal processing, noise removal, de-blurring), using mixture models / adaptive density estimators in high dimensions
  • Working with engineers on geographic information systems, and fractal image compression - a subject that I will discuss in my next book on automated data science. At the same time, working for a small R&D company, I designed models to predict extreme floods, using 3-D ground numerical models. The data was stored in an hierarchical database (digital images based on aerial pictures, the third dimension being elevation, and ground being segmented in different categories - water, crop, urban, forest etc.) Each pixel represented a few square feet.
  • Extensive research on simulation (to generate high quality random numbers, or to simulate various stochastic processes that model complex cluster structures)
  • Oil industry: detecting oild field boundary by minimizing the number of dry wells - known as the inside / outside problem, and based on convex domains estimation
  • Insurance: segmentation and scoring clients (discriminate analysis) 
At Cambridge University in 1995 (click here to see the names of all these statisticians)
When I moved to Cambridge university stats lab and then NISS to complete my post-doc (under the supervision of Professor Richard Smith), I worked on:
  • Markov Chains Monte Carlo modeling (Bayesian hierarchical models appied to complex cluster structures)
  • Spatio-tempral models
  • Environmental statistics: storm modeling, extreme value theory, and assessing leaks at the Hanford nuclear reservation (Washington State), using spatio-temporal models applied to chromium levels measured in 12 wells. The question was: is there a trend - increased leaking - and is it leaking into the Columbia river located a few hundred feet away?
  • Disaggregation of rainfall time series (purpose: improve irrigation, the way water is distributed during the day - agriculture project) 
  • I also wrote an interesting article on simulating multivariate distributions with known marginals
Note: AnalyticBridge's logo represents the mathematical bridge in Cambridge.
My first years in the corporate world (1996 - 2002)
I was first offered a job at MapQuest, to refine a system that helps car drivers with automated navigation. At that time, location of the vehicule was not determined by GPS, but by checking the speed and changes in direction (measured in degrees, as the driver makes a turn). This technique was prone to errors and that's why they wanted to hire a statistician. But eventually, I decided to work for CNET instead, as they offered a full time position rather than a consulting role.
I started in 1997 working for CNET, at that time a relatively small digital publisher (they eventually acquired ZDNet). My first project involved designing an alarm system, to send automated email to channel managers whenever traffic numbers were too low or too high: a red flag indicated significant under-performance, a bold red-flag indicated extreme under-performance. Managers could then trace the dips and spikes to events taking place on the platform, such as double load of traffic numbers (making the numbers 2x as big as they should be), web site down for a couple of hours, promotion etc. The alarm system used SAS to predict traffic (time series modeling, with seasonality, and confidence intervals for daily estimates), Perl/CGI to develop it as an API, access databases, and to send automated email, Sybase (star schema) to access traffic database and create a small database of predicted/estimated traffic (to match with real, observed traffic), and of course, cron jobs to run everything automatically, in batch mode, according to a pre-specified schedule - and resume automatically in case of crash or other failure (e.g. when production of traffic statistics were delayed or needed to be fixed fitst, due to some glitch). This might be the first time that I created automated data science.
Later in 2000, I was involved with market research, business and competitive intelligence. My title was Sr. Statistician. Besides identifying, defining, and designing tracking (measurement) methods for KPI's, here are some of the interesting projects I worked on:
  • Ad inventory management, in collaboration with GE (they owned NBCi, the company I was working for, and NBCi was a spin-off of CNET, first called in 1999). We worked on better predicting number of impressions available for advertisers, to optimize sales, reduce both over-booking and unsold inventory. I also came up with the reach and frequency formula  (a much cleaner description can be found in my book page 231). Note that most people consider this to be a supply chain problem, which is a sub-domain of operations research. It nevertheless is very statistics-intensive and heavily based on data, especially when the inventory is digital and very well tracked. 
  • Price elasticity study for, to determine optimum prices based on prices offered by competitors, number of competing products and other metrics. The statistical model was not linear, it involved variables such as minimum price offered by competitors, for each product and each day. I used a web crawler to extract the pricing information (displayed on the website) because the price database was terribly bad, with tons of missing prices and erroneous data.
  • Advertising mix optimization, using a long-term, macro-economic time series model, with monthly data from various sources (ad spend for various channels). I introduced a decay in the model, as TV ads seen six months ago still had an impact today, although smaller than recent ads. The model included up to the last 6 months worth of data.
  • Attribution modeling. How to detect, among a mix of 20 TV shows used each week to promote a website, which TV shows are most effective in driving sticky traffic to the website in question (NBCi). In short, the problem consists in attributing to each TV show, its share in traffic growth, to optimize the mix, and thus, the growth. It also includes looking at the lifetime value of a user (survival models) based on acquisition channel. You can find more details on this in my book page 248.
Consulting years (2002 - today)
I worked for various companies - Visa, Wells Fargo, InfoSpace, Looksmart, Microsoft, eBay, sometimes even as a regular employee, but mostly in a consulting capacity. It started with Visa in 2002, after a small stint with a statistical litigation company (William Wecker Associates), where I improved time-to-crime models that were biased because of right-censorship in the data (future crimes attached to a gun are not seen yet - this was an analysis in connection with the gun manufacturers lawsuit).
At Visa, I developed multivariate features for credit card fraud detection in real time, especially single-ping fraud, working on data sets with 50 million transactions - too big for SAS to handle at that time (a SAS sort would crash), and that's when I first developed Hadoop-like systems (nowadays, SAS sort can very easily handle 50 million rows without visible Map-Reduce technology). Most importantly, I used Perl, associative arrays and hash tables to process hundreds of feature combinations (to detect the best one based on some lift metric) while SAS would - at that time - process one feature combination over the whole weekend. Hash tables were used to store millions of bins, so an important part of the project was data binning - doing it right (too many bins results in a need for intensive Hadoop-like programming, too few results in lack of accuracy or predictability). That's when I came up with the concepts of hidden decision treespredictive power of a feature, and testing a large number of feature combinations simultaneously. This is much better explained in my book pages 225-228 and pages 153-158.  
After Visa, I worked at Wells Fargo, and my main contribution was to find that all our analyses were based on wrong data. It had been wrong for a long time without anyone noticing, well before I joined this project: Tealeaf sessions spanning accross multiple servers were broken in small sessions (we discovered it by simulating our own sessions and look at what shows up in the log files, the next day), making it impossible to really track user activity. Of course we fixed the problem. The purpose here was to make user navigation easier, and identify when a user is ready for cross-selling, and which products should be presented to him/her based on history.
So I moved away from the Internet, to Finance and fraud detection. But I came back to the Internet around 2005, this time to focus on traffic quality, click fraud, taxonomy creation, and optimizing bids on Google keywords - projects that require text mining and NLP (natural language processing) expertize. My most recent consulting years involved the following projects:
  • Microsoft: time series analysis (with correlograms) to detect and assess intensity of change points or new trends in KPI's, and match them with events (such as redifining user, which impacts user count). Blind analysis, in the sense that I was told about the events AFTER I detected the change points.
  • Adaptive A/B testing, where sample sizes are updated every day, to increase sample of best performer (and decrease sample of worst performer) as soon as a slight but statistically significant trend is discovered, usually half-way during the test
  • eBay: automated bidding for millions of keywords (old and new - most have no historical data) uploaded daily on Google AdWords. I also designed keyword scoring systems that predict performance in the absence of historical data by looking at metrics such as keyword length, number of tokens, presence of  digits or special characters or special tokens such as 'new', keyword rarity, keyword category and related keywords.
  • Click fraud and Botnet detectioncreation of new scoring system that uses IP flagging (by vendors such as Spamhaus and Barracuda) rather than unreliable conversion metrics, to predict quality and re-route Internet traffic. Ad matching (relevancy) algorithms (see my book pages 242-244). Creation of the Internet topology mapping (or see my book pages 143-147) to cluster IP addresses and better catch fraud, using advanced metrics based on these clusters.
  • Taxonomy creation: for instance I identified that the restaurant category, in the Yellow Pages, breaks down not only by type of cuisine, but also by type of restaurant (city, view, riverfront, mountain, pub, wine bar), atmosphere (elegant, romantic, family, fast food) and stuff that you don't eat (menus, recipes, restaurant jobs and furnitures). This analysis was based on millions of search queries from the Yellow Pages and crawling and parsing massive amounts of text (in particular the whole DMOZ directory including crawling 1 million websites), to identify keyword correlations. Click here for details, or read my book, pages 53-57.
  • Creation of the list of top 100,000 commercial keywords and categorization by user intent. Extensive use of Bing API and Google's paid keyword API to extract statistics for millions of keywords (value and volume depending on match type), as well as to detect new keywords. Click here for related search intelligence problem. 
During these years, I also created my first start-up to score Internet traffic (raising $6 million in funding) and produced a few patents.
As the co-founder of DataScienceCentral, I am also the data scientist on board, optimizing our email blasts and traffic growth with a mix of paid and organic traffic as well as various data science hacks. I also optimize client campaigns and manage a system of automated feeds for automated content production (see my book page 234). But the most visible part of my activity consists of 
I am also involved in designing API's and AaaS (Analytics as a Service). I actually wrote my first API in 2002 to sell stock trading signals: read my book pages 195-208 for details. I was even offered a job at Edison (utility company in Los Angeles) to trade electricity on their behalf. And I also worked on other arbitraging systems, in particular click arbitraging.
Grew revenue and profits from 5 to 7 digits in less than two years, while maintaining profit margins above 65%. Grew traffic and membership by 300% in two years. Introduced highly innovative, efficient, and scalable advertising products for our clients. DataScienceCentral is an entirely self-funded, lean startup with no debt, and no payroll (our biggest expense on gross revenue is taxes). I used state-of-the-art growth science techniques to outperform competition.
Publications, Conferences
Selected Data Science Articles: Click here to access the list. 
Refereed Publications
You can follow me on ResearchGate to check out my research activities.
Other Selected Publications
  • Granville V., Smith R.L. Clustering and Neyman-Scott process parameter simulation via Gibbs sampling. Tech. Rep. 95-19, Statistical Laboratory, University of Cambridge (1995, 20 pages).
  • Granville V. Sampling from a bivariate distribution with known marginals. Tech. Rep. 47, National Institute of Statistical Science, Research Triangle Park, North Carolina (1996).
  • Granville V., Smith R.LDisaggregation of rainfall time series via Gibbs sampling. Preprint, Statistics Department, University of North Carolina at Chapell Hill (1996, 20 pages).
  • Granville V., Rasson J.P., Orban-Ferauge F. From a natural to a behavioral classification ruleInEarth observation, Belgian Scientific Space Research (Ed.), 1994, pp. 127-145.
  • Granville V. Bayesian filtering and supervised classification in image remote sensing. Ph.D. Thesis, 1993.
Conference and Seminars
  • COMPSTAT 90, Dubrovnik, Yugoslavia, 1990. Communication: A new modeling of noise in image remote sensing. Funded by the Communaute Francaise de Belgique and FNRS.
  • 3rd Journees Scientifiques du Reseau de Teledetection de l’UREF, Toulouse, France, 1990. Poster: Une approche non gaussienne du bruit en traitement d’images. Funded.
  • 4th Journees Scientifiques du Reseau de Teledetection de l’UREF, Montreal, Canada, 1991. Poster: Un modele Bayesien de segmentation d’images. Funded by AUPELF-UREF.
  • 12th Franco-Belgian Meeting of Statisticians, Louvain, Belgium, 1991. Invited communication: Markov random field models in image remote sensing.
  • 24th Journees de Statistiques, Bruxelles, Belgium, 1992. Software presentation: Rainbow, un logiciel graphique dedicace.
  • 8th International Workshop on Statistical Modeling, Leuven, Belgium, 1993. Poster: Discriminate analysis and density estimation on the finite d-dimensional grid.
  • 14th Franco-Belgian Meeting of Statisticians, Namur, Belgium, 1993. Invited communication: Mesures d’intensite et outils geometriques en analyse discriminante.
  • Center for Wiskunde en Informatica, Amsterdam, Netherlands, 1993 (one week stay). Invited seminar: Bayesian filtering and supervised classification in image remote sensing. Invited by Professor Adrian Baddeley and funded by CWI.
  • Annual Meeting of the Belgian Statistical Society, Spa, Belgium, 1994. Invited communication: Maximum penalized likelihood density estimation.
  • Statistical Laboratory, University of Cambridge, 1994. Invited seminar: Discriminate analysis and filtering: applications in satellite imagery. Invited by Professor R.L. Smith. Funded by the University of Cambridge.
  • Invited seminars on Clustering, Bayesian statistics, spatial processes and MCMC simulation:Scientific Meeting FNRS, Brussels, 1995. Invited communication: Markov Chain Monte Carlo methods in clustering. Funded.
    • Biomathematics and Statistics Scotland (BioSS), Aberdeen, 1995. Invited by Rob Kempton and funded by BioSS.
    • Department of Mathematics, Imperial College, London, 1995. Invited by Professor A. Walden.
    • Statistical Laboratory, Iowa State University, Ames, Iowa, 1996. Invited by Professor Noel Cressie and funded by Iowa Ste University.
    • Department of Statistics and Actuarial Sciences, University of Iowa, Iowa City, Iowa, 1996. Invited by Professor Dale Zimmerman.
    • National Institute of Statistical Sciences (NISS), Research Triangle Park, North Carolina, 1996. Invited by Professor J. Sacks and Professor A.F. Karr. Funded by NISS.
  • 3rd European Seminar of Statistics, Toulouse, France, 1996. Invited. Funded by the EC.
  • 1st European Conference on Highly Structured Stochastic Systems, Rebild, Denmark, 1996. Contributed paper: Clustering and Neyman-Scott process parameter estimation via Gibbs sampling. Funded by the EC.
  • National Institute of Statistical Sciences (NISS), Research Triangle Park, North Carolina, 1996. Seminar: Statistical analysis of chemical deposits at the Hanford site.
  • Institute of Statistics and Decision Sciences, Duke University, Durham, North Carolina, 1997. Invited seminar: Stochastic models for hourly rainfall time series: fitting and statistical analysis based on Markov Chain Monte Carlo methods. Invited by Professor Mike West.
  • Southern California Edison, Los Angeles, 2003. Seminar: Efficient risk management for hedge fund managers. Invited and funded.
  • InfoSpace, Seattle, 2004. Seminar: Click validation, click fraud detection and click scoring for pay-per-click search engines. Invited and Funded.
  • AdTech, San Francisco, 2006. Click fraud panel.
  • AMSTAT JSM, Seattle, 2006. Talk: How to address click fraud in Pay-per-Click programs.
  • Predictive Analytics World, San Francisco, 2008. Talk: Predictive keyword scores to optimize pay-per-click campaigns. Invited and Funded.
  • M2009 – SAS Data Mining Conference, Las Vegas, 2009. Talk: Hidden decision trees to design predictive scores – application to .... Invited and Funded. 
  • Text Analytics Summit, San Jose, 2011. Detection of Spam, Unwelcomed Postings, and Commercial Abuses in So.... Invited. 

Fuzzy Regression: A Generic, Model-free, Math-free Machine Learning Technique

  A different way to do regression with prediction intervals. In Python and without math. No calculus, no matrix algebra, no statistical eng...