Data Mining

What is data mining?

Data mining is a hugely complex process, a form of artificial intelligence, which generally outputs results in the form of patterns, or predictions. There are many “families” of data mining depending on the type of analysis algorithm used.

Here are a few definitions.

Oracle define data mining as follows:

The practice of automatically searching large stores of data to discover patterns and trends that go beyond simple analysis. Data mining uses sophisticated mathematical algorithms to segment the data and evaluate the probability of future events. Data mining is also known as Knowledge Discovery in Data (KDD).

 The key properties of data mining are:

  • Automatic discovery of patterns
  • Prediction of likely outcomes
  • Creation of actionable information
  • Focus on large datasets and databases

Data mining can answer questions that cannot be addressed through simple query and reporting techniques.

 

Microsoft SQL Server 2016 contains “Analysis Services” which categorize algorithms as follows: 

  • Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset.
  • Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset.
  • Segmentation algorithms divide data into groups, or clusters, of items that have similar properties.
  • Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis.
  • Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web path flow. The most fascinating aspect of data mining is the fact that it may return unexpected or unpredictable results.

 

 

Examples

The Easy.Data.Mining site contains an interesting list of data mining experiments, here is the summary of the 3 I found most interesting:

  1. Beer & nappies: A supermarket put aside all and it reassessed its sales strategy with respect to the positioning of goods in the market. Even the usual categories of products of the trade chain were ignored, i.e. foods were not just compared to foods, but also to everything else. The supermarket also added other data for the analysis – e.g. the gender of the buyers, weekdays, and more. Men who have children and who (have to) do the shopping on Saturdays often tend to buy nappies for their little ones plus beer for the weekend evenings in front of the television. Subsequently, the supermarket decided to position the palettes of beer besides those of nappies on Saturdays – with the success of strongly risen sales figures.
  2. A car insurance company want to predict the probability of a car accident happening within a certain period of time on the basis of customer data available at the time of signing the insurance policy (e.g. personal data, attributes of the car to be insured, history of accidents.). A data table is available with each data record representing the data of a past customer at the beginning of a year and the customer’s claim class in that year. A prediction model is created using this data table. The prediction model reveals interesting customer segments with a high risk of belonging to a bad claim class.
  3. In a medical test phase a new treatment is performed on test patients. Personal attributes (e.g. weight, gender, medical history) is obtained and stored for each test patient. At the end of the test phase the patients are split into different classes depending if they reacted positively, neutrally or negatively to the treatment. Pattern recognition may reveal the combinations of attributes responsible for a patient to react positively or negatively to the treatment.

 

Why not try it out

While producing a data mining solution is extremely costly and requires years of research, there are software solutions out there, often included in CRM, data intelligence or database software suites.

Here is a list of Software Products containing Data Mining functionality:

  • SAS (Enterprise Miner)
  • IBM (SPSS Modeler)
  • Microsoft SQL (Analysis Services)
  • Oracle Data Mining

There are also many free solutions out there, such as Apache Mahout.

So why don’t you check whether the software licensed for your company includes data mining features. Spend some time running algorithms against your company’s data, training (fine-tuning) the data mining engine and who knows, you may find a rare gem or nugget which will revolutionize the way your company does business!

 

References

Big Data Versus Traditional Data Processing

What is Big Data?

Big Data has become a buzz word in the last few years. Many companies understand that capturing and analysing big volumes of data can help gain competitive advantage.

Gartner defines Big Data as: “high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”

In reality, not so many companies process Big Data (which would imply facing high volume, high velocity or high variety information assets), but many companies do process large volumes of data through modern technologies such as Software as a Service (SaaS) solutions and cloud-based data storage.

 

Case Study – 2 Irish Examples

Here are two Irish examples of companies whose business revolves around serious data processing.

fleetmatics1

Fleetmatics is a successful software company based in Tallaght, producing fleet management software. As of June 30, 2015, they served approximately 29,000 fleet management customers, with approximately 625,000 subscribed vehicles worldwide.

A key part of their software suite is automated geofencing, which consists in analyzing huge amounts of GPS data coming from vehicles fitted with GPS Tracking devices. Automated geofencing helps optimize daily routes for drivers, reducing travel times and so increasing productivity and reducing travel cost.

Fleetmatics2

 

 

Paddypower1

Paddy Power need no introduction. It is one of the world’s largest betting and gaming groups and employs over 1,000 in Ireland, including several hundred at their “Power Tower” R&D headquarters in Clonskeagh.

The company thrives on big volumes of data. Their Analytics and Marketing departments are constantly hiring. Here is an excerpt from their hiring site:

“Just because we like to have a bit of fun in what we do, don’t be fooled by the fluffy exterior. Beneath the surface Paddy Power is a numbers-driven company. We work in an industry underpinned by risk and probability and use a wide range of numerical methods every day to plan our marketing spend, assess our performance & beat the competition.”

“Our Customer Intelligence Team (CIT) plays a key role in providing analytical expertise across the Online Division. Within this centralised team, we analyse web traffic, promotional spend, betting patterns, online marketing and customer sentiment.“

CIT

 

Big Data or Traditional Data Processing?

I visited Fleetmatics and Paddy Power in the last 12 months, and both companies did state that, yes, their companies process “Big Data”.

I went to the companies’ careers sites to check the skills required for development roles. Were they looking for developers to work on “Big Data” or on “big data”?

Typical skills required at Fleetmatics were:

Coding: .NET Platform, C#, ASP/ASP.NET, JavaScript, jQuery and JSON

Database: Microsoft SQL, knowledge of MySql

I only found a few mild references to Big Data for roles at their Florence, Italy location (where they have a small research team focused on the GPS data research). Their open Research Engineer position job spec stated:

Essential Qualifications:

  • 2+ year’s scientific research experience in the field of Computer Science, Data Analytics or Operations Research;
  • Hands-on experience with predictive modeling, machine learning, data mining, and parallel computing.

Desired Qualifications:

  • PhD in Computer Science, Data Analytics or Operations Research.
  • Advanced knowledge of algorithms and data structures for large scale optimization problems; parallel and distributed programming.
  • Experience with R (r-project.org) and Big Data Analytics tools (Hadoop, Spark or similar).

 

As for Paddy Power, the typical skills required were around

  • Coding: .NET, C#, JavaScript, jQuery
  • Database: Microsoft SQL, NoSQL

Again it was difficult to find requirements around Big Data. There were no references to Hadoop or Big Data in any of the 150 jobs advertised.

 

Where Are The Job Opportunities?

It is quite apparent that Real Big Data processing – the type that requires Hadoop Distributed File System (HDFS) infrastructure and MapReduce technology – is not widespread.

Out of 2832 nationwide IT jobs advertized on irishjobs.ie on September 19, only 70 (2.4%) referred to Hadoop against 10.5% references to C#, 19.4% references to Java or 28.8% references to SQL.

My reading of this is that most companies remain “shy” about moving to Big Data analysis and that now is the ideal time to gain working experience in the field (possibly as a side activity if working, or as a “hobby” if not working).

While most companies are not yet ready to see big, there is a lot of interest building up in the field – for example meetup.com has a community of over 900 Big Data Developers and another with over 350 Hadoop enthusiasts:

Meetup

Traditional database administrator and database developer roles will not go away…

However we’ll see a big rise in modern data analysis, focusing more on data visualization and use of complex business intelligence suites of tools such as SAS (currently sought for 1.8% of IT job offers), Tableau (currently sought for 1.3% of IT job offers), etc. I wouldn’t be surprised if the demand would double in the next 12 months.

Both Tableau and SAS offer high cost certifications and trainings – a sure sign that these are software packages with a future. Here is a sample Udemy offers for a 6 hour Tableau online course:

Udemy

Predictions?

Big Data processing will not apply to all companies, but more and more companies (from an independent café to a small company selling niche product to any medium to large size organization) will realize they can benefit from using data analysis software to maximize their profit. There will most likely be great opportunities for Hadoop programmers in the many years to come, but there will also be lots of opportunities for data analysts who specialize in expert tools.

If “Big Data” (with a big B and a big D) does not apply, big data most certainly will, as all companies are storing more and more of their transaction data.

 

References

Everything Starts with Data – The Importance of Good Data Management

 

Importance of data

Information Systems (IS) are “a set of interrelated components that collect, manipulate & disseminate data & information & provide feedback to meet an objective.

Gathering Business Intelligence (BI) starts with obtaining data from the business environment

We often heard the expression “data is the lifeblood of business”.

For example PWC (180,00 employees):

pwc

Another example: Microsoft (100,000 employees), through their TechNet Magazine:

technet

There is a lot of talk about big data and unstructured data, but it is also important for companies to remember the importance of structured data and data management at enterprise level.

Master Data Management (MDM) is a pre-requisite to producing IS, BI at company level.

 

Having worked in several multinationals (specifically in the area of localisation) I know how difficult it can be to obtain accurate data.

Taking a simple example, I recently joined a well-known security software company as localisation engineer. My first challenge was to understand the products I would manage and people I would work with. In my first few days I read through many documents (ppts, docs, spreadsheets) stored on file control servers and team collaboration servers, to find endless lists of project and project assignments information.

Our team ended up creating a new up to date list for our function, representing our assignments – one week later we added information about other stakeholders not centralized elsewhere (content developers) to that list.

 

Centralizing data is the most efficient way of maintaining it and keeping it up to date. Unfortunately, using a bottom up approach to attempt to centralize information or data can results in resistance and a feeling of conflicts of interest – why are you asking us to create a centralized list, or worse still, why are you asking us to update the list you own and manage?

 

Master Data Management Frameworks

In most organizations, a top down approach is required. To implement efficient data management, and more particularly MDM. SAS, who have been at the forefront of data management & BI for many years, acknowledge both the complexity and the returns associated to implementing MDM:

With master data management, the payoff is big – but the process can be complicated. SAS knows that when it comes to implementing MDM, you need to get a solid start and build out from there. That’s why our experts work to understand your business and goals. It’s why we enable a phased approach to deliver business value. And it’s why we can get you up and running as quickly as possible.

Information Management (http://information-management.com/) is a good source for data management articles published by professionals. I went about understanding how companies successfully implemented MDM throughout their organization.

The best methodology I found was Gartner’s “Seven Building Blocks of Master Data Management” Framework:

gartner

It presents a fully top-down approach where vision and strategy are defined and sponsored by senior leadership. The leadership team invest in MDM governance and an MDM organization to ensure success. From that point, the processes to extract data, metrics to report on data and actual infrastructure can be agreed and implemented.

 

Once it comes to setting up the MDM infrastructure, there is a wealth of information available on the internet. Microsoft’s “The What, Why, and How of Master Data Management” article is very detailed and proposes 11 phases to successfully implement MDM:

  1. Identify sources of master data.
  2. Identify the producers and consumers of the master data.
  3. Collect and analyze metadata about for your master data.
  4. Appoint data stewards.
  5. Implement a data-governance program and data-governance council.
  6. Develop the master-data model.
  7. Choose a toolset.
  8. Design the infrastructure.
  9. Generate and test the master data.
  10. Modify the producing and consuming systems.
  11. Implement the maintenance processes.

 

Such best practices highlight that data collection and data management should not be underestimated or considered trivial.

Final thoughts

Without careful governance and centralized management, data ends up being disseminated, becomes difficult to manage and often becomes stale or unreliable.

Be aware of this if asked to provide metrics or KPIs to a management team. Make sure to at least work with others to best understand overlaps and avoid redundancy!

My main recommendation is to take a bird’s eye view when it comes to data gathering, consider it a project where planning and team collaboration are key to success. See it as a long term effort rather than a one-off scramble to provide data to your management team.

 

References

My First R program

Having completed an R tutorial (http://tryr.codeschool.com/) as advised by my lecturer Darren Redmond of Dublin Business school, I quickly realized that my first R program could return something a lot more interesting than “Hello World”.

So I decided to find out what is the average age at which a cyclist “peaks”. Here are the few steps I followed to get the initial data:

1. I got the names, dates of birth and other details related to the 100 best cyclists of 2014 from the official site CyclingRanking.com (http://www.cyclingranking.com/Rankings/LastYear.aspx).

2. I copy/pasted the cyclist data to Excel, made a few simple changes to the data and calculated the age of cyclist as: ( 01/07/2014 – <date of birth>) / 365):

Excel

3. Saved Name and Age data into a tab-separated CSV:

CSV

I then wanted to represent cyclist ages as a curve, which I imagined would be a close to perfect bell-curve with cyclist ages on the X axis – and showing me the mean value (representing the “optimal”/”peak” age for a cyclist) – at possibly 30 or32 years of age.

I went about writing the R code for this, which took me the best part of 10 minutes, including looking up how to import csv data and plotting the data.

Here is the code I produced:

dat <- read.csv(file=”C:/Users/Jean/Documents/DBS/DataAnalytics/R/cyclistage.csv”, header=TRUE, sep=”\t”) #read the 100 cyclist names and ages from the CSV

library(ggplot2)

ggplot(dat, aes(x=Age)) + geom_density() + geom_vline(aes(xintercept=mean(Age, na.rm=T)),color=”red”, linetype=”dashed”, size=1) #representation of density curve and mean value for age

And here is the result:

graph

3 things impressed me when I saw these results:

  1. Cyclists peak at ~28.5 years and not 30 to 32 as I expected
  2. Bell curves are sometimes not as perfect as what you generally see in books representing average weights / heights etc.
  3. But most of all, my first R program took a big set of data and turned it into a meaningful graph with only 3 lines of code – one of which simply loaded the library storing the graph function.

While I only spent very limited time on this assignment, it made me realize all the possibilities of aggregating and representing large amounts of data.

For example:

– See how average peak age of  cyclists evolved over the years (I looked at 2014 data, but figures are available since 1869. Imagine seeing how the average peak age move up/down in the form of a video as time goes by.

– Show individual cyclists’ ranking evolve over the years, in the form of curves or moving scatterplots

While not as rich and amazing as the data which Professor Hans Rosling presented in his 2006 “The best stats you’ve ever seen” Ted Talk (last watched on August 22 2015 on http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen) [zap directly to minute 4 if in a hurry], I am sure there would be ways of visualizing cyclists’ performance evolution in a very compelling way.

 

Use of Google’s Fusion Tables App

As part of my first Data Analytics assignment, I used Google Chrome’s Fusion Tables App to map Irish population across counties.
I used Dr. Mu Lin’s tutorial to get familiar with the creation of fusion tables, while the links to the Irish 2011 census population data and Irish KMZ (Keyhole Markup language, Zipped) file were provided by my lecturer.

The creation of fusion tables is very easy and basically consists in merging map data with population data, the common denominator being the county. The result is a map which can be customized to best represent data.

I chose a blue colour-scale (6 levels) to bucket the counties and visually highlight most/least populated counties.

Here is the output:

This showed me how quick and easy it can be to render data on a map – around 15 minutes.

I then thought of how this population data could be correlated to other data.

I was interested in finding out whether second-hand car sales at county level was somehow linked to population.

I went to carzone.ie and pulled down the number of used cars for sale per county:

cars per county

Based off this data, I calculated number of second-hand cars for sale per 1000 people and created fusion tables against this data. Here is the result:

Interestingly, there does appear to be some correlation: most densely populated counties (Dublin, Galway) tend to sell more second-hand cars that less populated counties (Leitrim, Monaghan, Cavan).

By making a few assumptions, we can come up with some interesting theories (further drill-down and research would be required to come to any definite conclusions) – for example:

We would expect that people in rural counties (least populated) own more cars per capita because public transport is less developed and generally distances to amenities would also be greater. However people in more densely populated counties sell more second hand cars per capita. This is not a contradiction but rather an indication that people in more populated counties are wealthier and change cars more often. This is quite visible in areas of South Dublin where 13x, 14x and 15x reg cars are more frequently seen than older models.

Of course there are exceptions which would be interesting to grasp – for example why are there so many second-hand cars for sale in Westmeath which is a low density county? Is it to do with the rich farming soil of the county? Are there any high profile companies based in Mullingar or Athlone which could explain the discrepancy?

Another factor to consider is that we are viewing county population as opposed to population density – in other words we are not taking into account the size of the county. While both Dublin and Cork are high population counties, Dublin is 20 times more densely populated than cork!

I found this assignment extremely interesting. Firstly it showed me how easy it is to publish visually appealing maps. More importantly it struck me how visual context can trigger the investigative thinking required for data analysis.