4 Data Analysis
Data analysis is all about taking raw data and turning it into useful information that can serve specific goals. In this section, we’ll explore the different types of data analysis, walk through the steps involved, and look at some common challenges that come up in the process.
4.1 Types of Analyses
4.1.1 Descriptive Analysis
Raw data can be difficult to work with (especially when there’s a lot of it), making it hard to manage or draw clear insights. Descriptive analysis helps by simplifying the data, making it easier to understand and interpret. This could involve summarizing the data or presenting it in a way that highlights patterns or trends. The goal of descriptive analysis isn’t to draw specific conclusions but to give a clear picture of the data as it is.
In descriptive analysis, two important concepts are central tendency and dispersion. Central tendency refers to measures like the mean, median, and mode, which help identify the “center” of the data. Dispersion, on the other hand, shows how spread out the data is, using measurements like standard deviation and ranges, such as the interquartile range.
4.1.2 Inferential Analysis
It is often impractical to collect data from an entire population, especially if it’s very large. For instance, surveying every employee in a big company for a pulse survey would be costly and time-consuming. Instead, a more practical approach is to collect data from a sample, which represents the broader population. Analyzing this sample is called inferential analysis.
This analysis involves estimating key parameters, like the ones mentioned above, and testing hypotheses. If the sample is large enough and chosen randomly, it can provide a good reflection of the population’s statistics, such as distribution, mean, and standard deviation. However, it’s important that the assumptions made about the population are reasonably accurate for the inferential analysis to work well.
It’s also crucial to ensure the sample truly represents the entire population. For example, conducting a company wide survey only in 3 departments during a national holiday may miss the views of employees who are based in different countries or departments, leading to skewed results.
You can add parts to organize one or more book chapters together. Parts can be inserted at the top of an .Rmd file, before the first-level chapter heading in that same file.
4.1.3 Predictive Analysis
Predictive analysis builds on the ideas of inferential analysis, allowing us to use past data to predict future outcomes. This is done by analysing an existing dataset with known attributes, called the training set, to identify potential predictive patterns. These patterns are then tested on a different dataset, known as the test set, to see how strong or reliable the relationships are.
A common example of predictive analysis is regression analysis. The simplest form is linear regression, where a linear relationship is assumed between a dependent variable and an independent variable. For example, linear regression could be used to predict a median salary given the seniority of an employee using past data to determine the slope between different job grades.
4.1.4 Prescriptive Analysis
Prescriptive analysis goes a step further than predictive analysis by not only forecasting future outcomes but also recommending actions to achieve desired results. It combines historical data, predictive models, and business rules to suggest the best course of action in various scenarios.
This analysis involves using algorithms and optimisation techniques to evaluate different options and their potential outcomes. For example, a company may use prescriptive analysis to reduce their gender pay gap by inputting the salary budget, compensation philosophy factors (performance, tenure, job grade, skills, etc) and a target or maximum allowed pay gap.
By simulating various scenarios, prescriptive analysis helps decision-makers understand the potential impact of their choices and identify the most effective strategies to reach their goals. Ultimately, it empowers organisations to make informed decisions that enhance efficiency and maximise outcomes.
4.2 The Data Analysis Process
Although the process of analysing data doesn’t follow a strict set of steps, it’s useful to consider the key stages that actuaries typically use when collecting and analysing data.
- Specify the Problem
- Develop a well-defined set of objectives which need to be met by the results of the data analysis.
- Identify the data items required for the analysis.
- Collection of the data from appropriate sources.
- Develop the Solution
- Processing and formatting data for analysis, eg inputting into a spreadsheet, database or other model.
- Cleaning data, eg addressing unusual, missing or inconsistent values.
- Exploratory data analysis, which may include:
- Descriptive analysis; producing summary statistics on central tendency and spread of the data.
- Inferential analysis; estimating summary parameters of the wider population of data, testing hypotheses.
- Predictive analysis; analysing data to make predictions about future events or other data sets.
- Modelling the data.
- Communicating the results.
- Monitoring the Experience
- Monitoring the process; updating the data and repeating the process if required.
Throughout the process, the team needs to ensure that any relevant professional guidance has been complied with. For example, IFRS or SHRM standards. Legal requirements also need to be complied with such as GDPR.
4.3 Data Sources and Lineage
In many companies, HR data was an afterthought and, unfortunately, many legacy systems and processes followed suit. In order to do a quality data analysis, the analyst needs have quality data.
In all situations, knowledge of the details of the collection process is important for a complete understanding of the data, including possible sources of bias or inaccuracy. Issues that the analyst should be aware of include:
- whether the process was manual or automated
- limitations on the precision of the data recorded
- whether there was any validation at source
- how was the data converted to an electronic form if it wasn’t collected automatically
These factors can affect the accuracy and reliability of the data collected. For example:
- if responses were collected on handwritten forms and then manually input into a system/database?
- if users inputting values into the system are following a clear and well defined process
Ultimately, it is important to track and visualise the flow of data as it moves through an organisation’s systems, from its origin to its final destination. It provides a comprehensive view of the data lifecycle, detailing where the data comes from, how it is transformed, and where it is stored or used.
Understanding data lineage is essential for several reasons:
- Data Quality: It helps identify potential issues in data quality by showing how data is transformed at each stage, making it easier to spot inconsistencies or errors.
- Compliance and Governance: Many regulations require organisations to maintain clear records of data usage and transformations. Data lineage supports compliance efforts by providing transparency.
- Impact Analysis: When changes are made to data systems or processes, data lineage allows organisations to assess the potential impact of those changes on downstream data and analytics.
- Troubleshooting: If problems arise, data lineage can help trace back to the source of the issue, facilitating faster resolution.
- Data Integration: It aids in understanding how different data sources interact, supporting more effective integration efforts.
When designing any analysis, it’s crucial to prioritise issues related to data security, privacy, and compliance with relevant regulations. It’s particularly important to recognise that merging data from various ‘anonymised’ sources can sometimes lead to the identification of individual cases.
4.4 Reproducible Research
Reproducible research refers to the practice of ensuring that the results of a study or analysis can be consistently replicated by other researchers using the same data, methods, and procedures. It involves providing enough detail about the research process so that others can follow the same steps and achieve similar outcomes. Reproducibility is a fundamental principle of scientific inquiry, as it enhances the credibility and reliability of research findings.
Doing things ‘by hand’ is very likely to create problems in reproducing the work. Examples of doing things by hand are:
- manually editing spreadsheets (rather than reading the raw data into a programming environment and making the changes there)
- editing tables and figures (rather than ensuring that the programming environment creates them exactly as needed)
- downloading data manually from a website (rather than doing it programmatically)
- pointing and clicking (unless the software used creates an audit trail of what has been clicked)
4.4.1 Elements Required for Reproducible Research
To achieve reproducible research, several key elements should be in place:
- Clear Documentation: Detailed documentation of the research methodology, including data collection methods, analytical techniques, and any assumptions made during the study.
- Access to Data: Providing access to the raw data used in the research, along with any relevant metadata that explains the data’s context and structure.
- Code and Software: Sharing the code and software used for analysis, ensuring that it is well-commented and includes instructions on how to run it.
- Version Control: Using version control systems (like Git) to track changes in code and documentation, making it easier for others to follow the research’s evolution.
- Standardised Formats: Employing standard formats for data and code to facilitate understanding and interoperability.
- Open Access: Making research findings, data, and code openly accessible, when possible, to encourage collaboration and verification by others.
- Comprehensive Reporting: Providing a thorough report that includes all relevant information, such as the context of the research, limitations, and potential biases.
4.4.2 Value of Reproducibility
Many PA analyses are conducted for commercial rather than scientific reasons and often remain unpublished. However, reproducibility still plays a crucial role for several reasons:
- Trust and Credibility: When research can be reproduced, it enhances trust in the findings and credibility in the scientific community.
- Validation of Results: Reproducibility allows other researchers to validate results, helping to confirm or challenge initial findings.
- Collaboration: Open and reproducible research fosters collaboration among researchers, enabling them to build on each other’s work and accelerate scientific progress.
- Technical Work Review: Reproducibility is essential for a thorough technical review, which is often a professional requirement. It ensures that the analysis has been conducted correctly and that the conclusions drawn are well-supported by the data.
- Regulatory and Audit Compliance: External regulators and auditors may require evidence of reproducibility to ensure that analyses meet industry standards and regulations.
- Extensibility: Reproducible research can be more easily expanded to explore the impacts of changes in the analysis or to integrate new data.
- Comparison with Past Investigations: When comparing results with similar investigations conducted in the past, having a reproducible record allows for confident analysis of any differences, provided that the earlier work was also reported reproducibly.
- Reduced Errors and Increased Efficiency: The discipline associated with reproducible research, which emphasises thorough documentation of processes and careful data storage, can lead to fewer errors that require correction, ultimately enhancing efficiency.
There are some limitations to what reproducibility addresses: - Not a Guarantee of Correctness: Reproducibility does not ensure that the analysis is correct. For example, if an incorrect distribution is assumed, the results may still be wrong, even if they can be reproduced under the same flawed assumption. Nonetheless, clear documentation allows for transparency, enabling others to challenge any incorrect analyses. - Timing of Reproducibility Activities: If reproducibility efforts are only undertaken at the end of an analysis, it may be too late to address any resulting challenges. Resources may have already been allocated to other projects, making it difficult to revisit the analysis when needed.
4.5 Models and Modelling
A model is an imitation of a real-world system or process. As per the classic aphorism attributed to George Box, “All models are wrong but some are useful.” Models can be built to simulate many real-world activities such as:
- Financial institutions can develop fraud detection models to identify potentially fraudulent transactions in real-time
- Organisations can develop models to predict which employees are at risk of leaving by analysing historical data on employee turnover.
- In manufacturing, predictive maintenance models can be developed to forecast equipment failures before they occur.
Care needs to be taken when building models. It is important to understand the variables, data and algorithms used in a model and the potential impacts. For example, a model can be built to automate the recruitment process by analysing data from past hiring decisions. This type of model would not be useful in many settings as it may repeat potential errors of the past or be illegal.
Where models can be useful is in giving us insights into what “might happen” if we take a course of action. In many cases, it may be too risky, slow or expensive to implement a process or system change or to try it on a control group. Being able to model the outcomes can help understand the impacts of decisions made. For example, modelling the impact of relocating offices on attrition before deciding to move.
4.5.1 Parameters
A model allows for the exploration of potential outcomes and consequences. By examining the impact of varying specific input parameters, organisations can analyse different scenarios before making decisions to implement their plans in the real world. This approach helps in understanding the risks and benefits associated with various options, ultimately leading to more informed and strategic decision-making.
To create a model of a system or process, it’s essential to establish a set of mathematical or logical assumptions regarding its functioning. The complexity of the model is influenced by the intricacy of the relationships between the different parameters involved.
For instance, when modelling a life insurance company, several factors must be taken into account, including regulations, taxation, and cancellation terms. Additionally, future events that could impact investment returns, inflation, new business opportunities, lapses, mortality rates, and expenses also play a crucial role in shaping these relationships.
4.5.2 Data
To develop a model of a system or process, it is important to define a set of mathematical or logical assumptions about how it operates. The complexity of the model depends on how intricate the relationships are among the various parameters involved.
For example, in the context of modelling a life insurance company, various factors must be considered, such as regulations, taxation, and cancellation terms. Furthermore, future events that could influence investment returns, inflation, new business opportunities, lapses, mortality rates, and expenses are vital in shaping these relationships.
4.5.3 Objectives
Before finalising the selection of the model and its parameters, it’s crucial to consider the objectives behind creating and using the model. For instance, in many situations, the goal may not be to develop the most precise model but rather to construct one that avoids underestimating costs or other associated risks. Understanding these objectives helps ensure that the model effectively meets the needs of its intended application.
4.6 Model Usage
Although the modelling process doesn’t follow a strict sequence of prescribed steps in practice, it can be helpful to outline a set of key stages for introductory purposes. In reality, actuaries who create and utilise models often move back and forth between these steps continuously, refining and enhancing the model as they go. This iterative approach allows for ongoing improvements and adjustments based on new insights and data.
The key steps in the modelling process can be outlined as follows:
Define Clear Objectives: Establish a well-defined set of goals that the modelling process aims to achieve. For example, in modelling insurance claims, aside from predicting the number of claims of various sizes, one objective might be to accurately forecast 95% of the claims.
Plan the Modelling Process: Outline the modelling approach and determine how the model will be validated. Validation will involve a series of diagnostic tests to ensure that the model meets the defined objectives.
Collect and Analyse Data: Gather the necessary data required for the model and conduct an analysis to ensure its relevance and reliability.
Define Model Parameters: Specify the parameters for the model and consider appropriate values for these parameters.
Initial Model Definition: Create an initial version of the model that captures the essential features of the real-world system. The level of detail can be refined in later stages.
Consult Experts: Engage with subject matter experts related to the real-world system being modelled to gather feedback on the validity of the conceptual model.
Choose Implementation Tools: Decide whether to use a simulation package or a general-purpose programming language for the model’s implementation. Select a statistically reliable random number generator that suits the model’s complexity. Note that deterministic models do not require this step.
Write the Program: Develop the computer program for the model, allowing it to be executed once this stage is complete.
Debug the Program: Test and debug the program to ensure it performs the intended operations as defined in the model.
Test Output Reasonableness: Evaluate the reasonableness of the model’s output to ensure it aligns with expectations. The stability of the relationships included in the model may not hold up over the long term. For instance, what appears to be exponential growth can seem linear if observed over a short timeframe. While it is possible to incorporate anticipated changes into the model, it is often the case that longer-term models can be unreliable. By their very nature, models are simplified representations of reality. As a result, they may overlook ‘higher order’ relationships that seem insignificant in the short term but can accumulate and become important over the longer term. It’s essential to recognise these limitations when interpreting model outputs and making decisions based on them.
Review Model Sensitivity: Carefully assess the model’s appropriateness by examining how small changes in input parameters affect the output. If minor changes lead to significant variations in results, it indicates uncertainty in predictions. In such cases, the model can still be useful for generating a range of possible outputs by assuming various input values. This approach, known as sensitivity testing. This involves examining the sensitivity of the outputs to small changes in the inputs or their statistical distributions. If minor adjustments to inputs or their distributions result in significant changes in the outputs, it becomes crucial to reassess the model’s appropriateness. This process helps identify the key inputs and relationships that require careful consideration when designing and using the model, ensuring that the most critical factors are appropriately addressed.
Analyse Output: Conduct a thorough analysis of the output generated by the model to derive insights. If there is a real-world system available for comparison, a ‘Turing’ test can be employed. In this test, experts familiar with the real-world system are asked to compare several sets of data from both the real world and the model without knowing which is which. If these experts can successfully distinguish between the real-world data and the model output, their methods for doing so can provide valuable insights for enhancing the model
Ensure Compliance with Professional Guidance: Verify that any relevant professional guidelines, such as those from the Society for People Analytics (SPA) or Society for Human Resource Management (SHRM), have been followed.
Communicate and Document Results: After developing the model, document the results and the model itself. If the model will be used over time, it should be periodically reviewed to ensure it continues to meet user objectives, with opportunities for improvement explored by revisiting Steps 3 to 12 based on new data obtained since the original predictions were made. It is crucial to tailor the communication to the knowledge level and perspective of the target audience. A primary concern is ensuring that the client recognises the model as a valid and valuable tool for decision-making. Additionally, it is essential to highlight any limitations regarding the model’s use and validity, ensuring that these are fully understood and appreciated.
4.7 Advantages and Disadvantages of Modelling
In PA, one of the key benefits of modelling is that complex systems can be simplified to the core components that underpin the system. Other advantages include:
- Complex systems with stochastic elements, such as workforce planning in an organisation, cannot be effectively described using a purely mathematical or logical model that provides easily interpretable results. Simulation modelling offers a valuable way to study workforce behaviour. For example, in people analytics, a simulation can be used to model future headcount, taking into account factors like job tenure, attrition, and external market conditions, helping HR professionals understand how different factors influence business goals and to make more informed workforce decisions.
- Different future actions can be compared to see which best suits the requirements or constraints of the business.
- In a model of a complex system, we typically have the ability to control the experimental conditions, which allows us to reduce the variance in the model’s output without affecting the mean values. This control helps in obtaining more consistent and reliable results from the model while maintaining the overall accuracy of the predictions, ensuring that the key insights remain valid even as variability is minimised.
However, models are not a one-size-fits-all solution for PA problems. They come with limitations that need to be recognised. It’s important to understand these drawbacks when interpreting the model’s output and when communicating the results to stakeholders. This ensures that the insights provided are realistic and that any uncertainties or assumptions are properly accounted for in decision-making. The disadvantages include:
- Model development demands a significant investment of time and expertise. The financial costs can be substantial, as the process involves validating the model’s assumptions, ensuring the accuracy of the computer code, verifying the reasonableness of the results, and translating those results into plain language that the target audience can easily understand. This thorough approach is necessary to ensure the model is both reliable and useful for its intended purpose.
- In a stochastic model, each run with a given set of inputs provides only an estimate of the model’s outputs due to the randomness involved. To thoroughly study the outputs for a specific set of inputs, multiple independent runs of the model are required. This repetition allows for a more accurate understanding of the potential range of outcomes and helps capture the variability inherent in stochastic processes.
- Models can appear impressive when run on a computer, which can create a false sense of confidence in their accuracy. However, if a model hasn’t undergone proper validity and verification tests, its polished output is no substitute for its ability to accurately represent the real-world system it aims to imitate. It’s essential to ensure that the model has been rigorously tested before relying on its results for decision-making.
- Models rely heavily on the data input. If the data quality is poor or lacks credibility, then the output from the model is likely to be flawed.
- Users of the model must fully understand how it works and the situations where it can be safely applied. There is a risk of treating the model as a ‘black box,’ assuming that all its results are valid without carefully considering whether the model is appropriate for the specific data inputs and the type of output expected. Proper understanding helps prevent misuse and ensures that the model’s limitations and applicability are taken into account.
- It is not possible to include all future events in a model. For example, a change in legislation could invalidate the results of a model.
4.8 Final Thoughts
It is tempting to want to flex your analytical muscles and show off the great skills you have developed but producing a solution that solves the problem in a timely manner is more valuable. Early in my career, I was doing a project trying to help an energy provider become more stable and ensure maximum output of electricity. I spent much time building quite an advanced model to forecast maintenance, electricity demand, staff availability etc. After much work, we had a few recommendations that would improve operations at the plant but many solutions would be costly. One night, we accidentally stayed a bit too late at the plant and noticed that a conveyor belt was running without any coal on it. The coal mine that fueled the plant was quite a distance away and having this on would be quite a high internal electricity demand. The next day we investigated this and it turned out that, by switching the conveyor belt off during low demand times, we could increase the energy output and efficiency of the plant by a significant percentage with no additional cost. Just that simple task was almost as valuable as everything else we had put together but was, by far, the cheapest option. Thereafter, the only ‘model’ needed was to forecast the future saving by switching the belt off and that was simply \(\frac{10}{24} \times total\ belt\ usage\). Following the ACC and truly understanding the problem is far more useful and powerful than even the most advanced models in my experience.