Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Recent quantitative research on determinants of health in high income countries: A scoping review

Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliation Centre for Health Economics Research and Modelling Infectious Diseases, Vaccine and Infectious Disease Institute, University of Antwerp, Antwerp, Belgium

ORCID logo

Roles Conceptualization, Data curation, Funding acquisition, Project administration, Resources, Supervision, Validation, Visualization, Writing – review & editing

  • Vladimira Varbanova, 
  • Philippe Beutels

PLOS

  • Published: September 17, 2020
  • https://doi.org/10.1371/journal.pone.0239031
  • Peer Review
  • Reader Comments

Fig 1

Identifying determinants of health and understanding their role in health production constitutes an important research theme. We aimed to document the state of recent multi-country research on this theme in the literature.

We followed the PRISMA-ScR guidelines to systematically identify, triage and review literature (January 2013—July 2019). We searched for studies that performed cross-national statistical analyses aiming to evaluate the impact of one or more aggregate level determinants on one or more general population health outcomes in high-income countries. To assess in which combinations and to what extent individual (or thematically linked) determinants had been studied together, we performed multidimensional scaling and cluster analysis.

Sixty studies were selected, out of an original yield of 3686. Life-expectancy and overall mortality were the most widely used population health indicators, while determinants came from the areas of healthcare, culture, politics, socio-economics, environment, labor, fertility, demographics, life-style, and psychology. The family of regression models was the predominant statistical approach. Results from our multidimensional scaling showed that a relatively tight core of determinants have received much attention, as main covariates of interest or controls, whereas the majority of other determinants were studied in very limited contexts. We consider findings from these studies regarding the importance of any given health determinant inconclusive at present. Across a multitude of model specifications, different country samples, and varying time periods, effects fluctuated between statistically significant and not significant, and between beneficial and detrimental to health.

Conclusions

We conclude that efforts to understand the underlying mechanisms of population health are far from settled, and the present state of research on the topic leaves much to be desired. It is essential that future research considers multiple factors simultaneously and takes advantage of more sophisticated methodology with regards to quantifying health as well as analyzing determinants’ influence.

Citation: Varbanova V, Beutels P (2020) Recent quantitative research on determinants of health in high income countries: A scoping review. PLoS ONE 15(9): e0239031. https://doi.org/10.1371/journal.pone.0239031

Editor: Amir Radfar, University of Central Florida, UNITED STATES

Received: November 14, 2019; Accepted: August 28, 2020; Published: September 17, 2020

Copyright: © 2020 Varbanova, Beutels. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting Information files.

Funding: This study (and VV) is funded by the Research Foundation Flanders ( https://www.fwo.be/en/ ), FWO project number G0D5917N, award obtained by PB. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Identifying the key drivers of population health is a core subject in public health and health economics research. Between-country comparative research on the topic is challenging. In order to be relevant for policy, it requires disentangling different interrelated drivers of “good health”, each having different degrees of importance in different contexts.

“Good health”–physical and psychological, subjective and objective–can be defined and measured using a variety of approaches, depending on which aspect of health is the focus. A major distinction can be made between health measurements at the individual level or some aggregate level, such as a neighborhood, a region or a country. In view of this, a great diversity of specific research topics exists on the drivers of what constitutes individual or aggregate “good health”, including those focusing on health inequalities, the gender gap in longevity, and regional mortality and longevity differences.

The current scoping review focuses on determinants of population health. Stated as such, this topic is quite broad. Indeed, we are interested in the very general question of what methods have been used to make the most of increasingly available region or country-specific databases to understand the drivers of population health through inter-country comparisons. Existing reviews indicate that researchers thus far tend to adopt a narrower focus. Usually, attention is given to only one health outcome at a time, with further geographical and/or population [ 1 , 2 ] restrictions. In some cases, the impact of one or more interventions is at the core of the review [ 3 – 7 ], while in others it is the relationship between health and just one particular predictor, e.g., income inequality, access to healthcare, government mechanisms [ 8 – 13 ]. Some relatively recent reviews on the subject of social determinants of health [ 4 – 6 , 14 – 17 ] have considered a number of indicators potentially influencing health as opposed to a single one. One review defines “social determinants” as “the social, economic, and political conditions that influence the health of individuals and populations” [ 17 ] while another refers even more broadly to “the factors apart from medical care” [ 15 ].

In the present work, we aimed to be more inclusive, setting no limitations on the nature of possible health correlates, as well as making use of a multitude of commonly accepted measures of general population health. The goal of this scoping review was to document the state of the art in the recent published literature on determinants of population health, with a particular focus on the types of determinants selected and the methodology used. In doing so, we also report the main characteristics of the results these studies found. The materials collected in this review are intended to inform our (and potentially other researchers’) future analyses on this topic. Since the production of health is subject to the law of diminishing marginal returns, we focused our review on those studies that included countries where a high standard of wealth has been achieved for some time, i.e., high-income countries belonging to the Organisation for Economic Co-operation and Development (OECD) or Europe. Adding similar reviews for other country income groups is of limited interest to the research we plan to do in this area.

In view of its focus on data and methods, rather than results, a formal protocol was not registered prior to undertaking this review, but the procedure followed the guidelines of the PRISMA statement for scoping reviews [ 18 ].

We focused on multi-country studies investigating the potential associations between any aggregate level (region/city/country) determinant and general measures of population health (e.g., life expectancy, mortality rate).

Within the query itself, we listed well-established population health indicators as well as the six world regions, as defined by the World Health Organization (WHO). We searched only in the publications’ titles in order to keep the number of hits manageable, and the ratio of broadly relevant abstracts over all abstracts in the order of magnitude of 10% (based on a series of time-focused trial runs). The search strategy was developed iteratively between the two authors and is presented in S1 Appendix . The search was performed by VV in PubMed and Web of Science on the 16 th of July, 2019, without any language restrictions, and with a start date set to the 1 st of January, 2013, as we were interested in the latest developments in this area of research.

Eligibility criteria

Records obtained via the search methods described above were screened independently by the two authors. Consistency between inclusion/exclusion decisions was approximately 90% and the 43 instances where uncertainty existed were judged through discussion. Articles were included subject to meeting the following requirements: (a) the paper was a full published report of an original empirical study investigating the impact of at least one aggregate level (city/region/country) factor on at least one health indicator (or self-reported health) of the general population (the only admissible “sub-populations” were those based on gender and/or age); (b) the study employed statistical techniques (calculating correlations, at the very least) and was not purely descriptive or theoretical in nature; (c) the analysis involved at least two countries or at least two regions or cities (or another aggregate level) in at least two different countries; (d) the health outcome was not differentiated according to some socio-economic factor and thus studied in terms of inequality (with the exception of gender and age differentiations); (e) mortality, in case it was one of the health indicators under investigation, was strictly “total” or “all-cause” (no cause-specific or determinant-attributable mortality).

Data extraction

The following pieces of information were extracted in an Excel table from the full text of each eligible study (primarily by VV, consulting with PB in case of doubt): health outcome(s), determinants, statistical methodology, level of analysis, results, type of data, data sources, time period, countries. The evidence is synthesized according to these extracted data (often directly reflected in the section headings), using a narrative form accompanied by a “summary-of-findings” table and a graph.

Search and selection

The initial yield contained 4583 records, reduced to 3686 after removal of duplicates ( Fig 1 ). Based on title and abstract screening, 3271 records were excluded because they focused on specific medical condition(s) or specific populations (based on morbidity or some other factor), dealt with intervention effectiveness, with theoretical or non-health related issues, or with animals or plants. Of the remaining 415 papers, roughly half were disqualified upon full-text consideration, mostly due to using an outcome not of interest to us (e.g., health inequality), measuring and analyzing determinants and outcomes exclusively at the individual level, performing analyses one country at a time, employing indices that are a mixture of both health indicators and health determinants, or not utilizing potential health determinants at all. After this second stage of the screening process, 202 papers were deemed eligible for inclusion. This group was further dichotomized according to level of economic development of the countries or regions under study, using membership of the OECD or Europe as a reference “cut-off” point. Sixty papers were judged to include high-income countries, and the remaining 142 included either low- or middle-income countries or a mix of both these levels of development. The rest of this report outlines findings in relation to high-income countries only, reflecting our own primary research interests. Nonetheless, we chose to report our search yield for the other income groups for two reasons. First, to gauge the relative interest in applied published research for these different income levels; and second, to enable other researchers with a focus on determinants of health in other countries to use the extraction we made here.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0239031.g001

Health outcomes

The most frequent population health indicator, life expectancy (LE), was present in 24 of the 60 studies. Apart from “life expectancy at birth” (representing the average life-span a newborn is expected to have if current mortality rates remain constant), also called “period LE” by some [ 19 , 20 ], we encountered as well LE at 40 years of age [ 21 ], at 60 [ 22 ], and at 65 [ 21 , 23 , 24 ]. In two papers, the age-specificity of life expectancy (be it at birth or another age) was not stated [ 25 , 26 ].

Some studies considered male and female LE separately [ 21 , 24 , 25 , 27 – 33 ]. This consideration was also often observed with the second most commonly used health index [ 28 – 30 , 34 – 38 ]–termed “total”, or “overall”, or “all-cause”, mortality rate (MR)–included in 22 of the 60 studies. In addition to gender, this index was also sometimes broken down according to age group [ 30 , 39 , 40 ], as well as gender-age group [ 38 ].

While the majority of studies under review here focused on a single health indicator, 23 out of the 60 studies made use of multiple outcomes, although these outcomes were always considered one at a time, and sometimes not all of them fell within the scope of our review. An easily discernable group of indices that typically went together [ 25 , 37 , 41 ] was that of neonatal (deaths occurring within 28 days postpartum), perinatal (fetal or early neonatal / first-7-days deaths), and post-neonatal (deaths between the 29 th day and completion of one year of life) mortality. More often than not, these indices were also accompanied by “stand-alone” indicators, such as infant mortality (deaths within the first year of life; our third most common index found in 16 of the 60 studies), maternal mortality (deaths during pregnancy or within 42 days of termination of pregnancy), and child mortality rates. Child mortality has conventionally been defined as mortality within the first 5 years of life, thus often also called “under-5 mortality”. Nonetheless, Pritchard & Wallace used the term “child mortality” to denote deaths of children younger than 14 years [ 42 ].

As previously stated, inclusion criteria did allow for self-reported health status to be used as a general measure of population health. Within our final selection of studies, seven utilized some form of subjective health as an outcome variable [ 25 , 43 – 48 ]. Additionally, the Health Human Development Index [ 49 ], healthy life expectancy [ 50 ], old-age survival [ 51 ], potential years of life lost [ 52 ], and disability-adjusted life expectancy [ 25 ] were also used.

We note that while in most cases the indicators mentioned above (and/or the covariates considered, see below) were taken in their absolute or logarithmic form, as a—typically annual—number, sometimes they were used in the form of differences, change rates, averages over a given time period, or even z-scores of rankings [ 19 , 22 , 40 , 42 , 44 , 53 – 57 ].

Regions, countries, and populations

Despite our decision to confine this review to high-income countries, some variation in the countries and regions studied was still present. Selection seemed to be most often conditioned on the European Union, or the European continent more generally, and the Organisation of Economic Co-operation and Development (OECD), though, typically, not all member nations–based on the instances where these were also explicitly listed—were included in a given study. Some of the stated reasons for omitting certain nations included data unavailability [ 30 , 45 , 54 ] or inconsistency [ 20 , 58 ], Gross Domestic Product (GDP) too low [ 40 ], differences in economic development and political stability with the rest of the sampled countries [ 59 ], and national population too small [ 24 , 40 ]. On the other hand, the rationales for selecting a group of countries included having similar above-average infant mortality [ 60 ], similar healthcare systems [ 23 ], and being randomly drawn from a social spending category [ 61 ]. Some researchers were interested explicitly in a specific geographical region, such as Eastern Europe [ 50 ], Central and Eastern Europe [ 48 , 60 ], the Visegrad (V4) group [ 62 ], or the Asia/Pacific area [ 32 ]. In certain instances, national regions or cities, rather than countries, constituted the units of investigation instead [ 31 , 51 , 56 , 62 – 66 ]. In two particular cases, a mix of countries and cities was used [ 35 , 57 ]. In another two [ 28 , 29 ], due to the long time periods under study, some of the included countries no longer exist. Finally, besides “European” and “OECD”, the terms “developed”, “Western”, and “industrialized” were also used to describe the group of selected nations [ 30 , 42 , 52 , 53 , 67 ].

As stated above, it was the health status of the general population that we were interested in, and during screening we made a concerted effort to exclude research using data based on a more narrowly defined group of individuals. All studies included in this review adhere to this general rule, albeit with two caveats. First, as cities (even neighborhoods) were the unit of analysis in three of the studies that made the selection [ 56 , 64 , 65 ], the populations under investigation there can be more accurately described as general urban , instead of just general. Second, oftentimes health indicators were stratified based on gender and/or age, therefore we also admitted one study that, due to its specific research question, focused on men and women of early retirement age [ 35 ] and another that considered adult males only [ 68 ].

Data types and sources

A great diversity of sources was utilized for data collection purposes. The accessible reference databases of the OECD ( https://www.oecd.org/ ), WHO ( https://www.who.int/ ), World Bank ( https://www.worldbank.org/ ), United Nations ( https://www.un.org/en/ ), and Eurostat ( https://ec.europa.eu/eurostat ) were among the top choices. The other international databases included Human Mortality [ 30 , 39 , 50 ], Transparency International [ 40 , 48 , 50 ], Quality of Government [ 28 , 69 ], World Income Inequality [ 30 ], International Labor Organization [ 41 ], International Monetary Fund [ 70 ]. A number of national databases were referred to as well, for example the US Bureau of Statistics [ 42 , 53 ], Korean Statistical Information Services [ 67 ], Statistics Canada [ 67 ], Australian Bureau of Statistics [ 67 ], and Health New Zealand Tobacco control and Health New Zealand Food and Nutrition [ 19 ]. Well-known surveys, such as the World Values Survey [ 25 , 55 ], the European Social Survey [ 25 , 39 , 44 ], the Eurobarometer [ 46 , 56 ], the European Value Survey [ 25 ], and the European Statistics of Income and Living Condition Survey [ 43 , 47 , 70 ] were used as data sources, too. Finally, in some cases [ 25 , 28 , 29 , 35 , 36 , 41 , 69 ], built-for-purpose datasets from previous studies were re-used.

In most of the studies, the level of the data (and analysis) was national. The exceptions were six papers that dealt with Nomenclature of Territorial Units of Statistics (NUTS2) regions [ 31 , 62 , 63 , 66 ], otherwise defined areas [ 51 ] or cities [ 56 ], and seven others that were multilevel designs and utilized both country- and region-level data [ 57 ], individual- and city- or country-level [ 35 ], individual- and country-level [ 44 , 45 , 48 ], individual- and neighborhood-level [ 64 ], and city-region- (NUTS3) and country-level data [ 65 ]. Parallel to that, the data type was predominantly longitudinal, with only a few studies using purely cross-sectional data [ 25 , 33 , 43 , 45 – 48 , 50 , 62 , 67 , 68 , 71 , 72 ], albeit in four of those [ 43 , 48 , 68 , 72 ] two separate points in time were taken (thus resulting in a kind of “double cross-section”), while in another the averages across survey waves were used [ 56 ].

In studies using longitudinal data, the length of the covered time periods varied greatly. Although this was almost always less than 40 years, in one study it covered the entire 20 th century [ 29 ]. Longitudinal data, typically in the form of annual records, was sometimes transformed before usage. For example, some researchers considered data points at 5- [ 34 , 36 , 49 ] or 10-year [ 27 , 29 , 35 ] intervals instead of the traditional 1, or took averages over 3-year periods [ 42 , 53 , 73 ]. In one study concerned with the effect of the Great Recession all data were in a “recession minus expansion change in trends”-form [ 57 ]. Furthermore, there were a few instances where two different time periods were compared to each other [ 42 , 53 ] or when data was divided into 2 to 4 (possibly overlapping) periods which were then analyzed separately [ 24 , 26 , 28 , 29 , 31 , 65 ]. Lastly, owing to data availability issues, discrepancies between the time points or periods of data on the different variables were occasionally observed [ 22 , 35 , 42 , 53 – 55 , 63 ].

Health determinants

Together with other essential details, Table 1 lists the health correlates considered in the selected studies. Several general categories for these correlates can be discerned, including health care, political stability, socio-economics, demographics, psychology, environment, fertility, life-style, culture, labor. All of these, directly or implicitly, have been recognized as holding importance for population health by existing theoretical models of (social) determinants of health [ 74 – 77 ].

thumbnail

https://doi.org/10.1371/journal.pone.0239031.t001

It is worth noting that in a few studies there was just a single aggregate-level covariate investigated in relation to a health outcome of interest to us. In one instance, this was life satisfaction [ 44 ], in another–welfare system typology [ 45 ], but also gender inequality [ 33 ], austerity level [ 70 , 78 ], and deprivation [ 51 ]. Most often though, attention went exclusively to GDP [ 27 , 29 , 46 , 57 , 65 , 71 ]. It was often the case that research had a more particular focus. Among others, minimum wages [ 79 ], hospital payment schemes [ 23 ], cigarette prices [ 63 ], social expenditure [ 20 ], residents’ dissatisfaction [ 56 ], income inequality [ 30 , 69 ], and work leave [ 41 , 58 ] took center stage. Whenever variables outside of these specific areas were also included, they were usually identified as confounders or controls, moderators or mediators.

We visualized the combinations in which the different determinants have been studied in Fig 2 , which was obtained via multidimensional scaling and a subsequent cluster analysis (details outlined in S2 Appendix ). It depicts the spatial positioning of each determinant relative to all others, based on the number of times the effects of each pair of determinants have been studied simultaneously. When interpreting Fig 2 , one should keep in mind that determinants marked with an asterisk represent, in fact, collectives of variables.

thumbnail

Groups of determinants are marked by asterisks (see S1 Table in S1 Appendix ). Diminishing color intensity reflects a decrease in the total number of “connections” for a given determinant. Noteworthy pairwise “connections” are emphasized via lines (solid-dashed-dotted indicates decreasing frequency). Grey contour lines encircle groups of variables that were identified via cluster analysis. Abbreviations: age = population age distribution, associations = membership in associations, AT-index = atherogenic-thrombogenic index, BR = birth rate, CAPB = Cyclically Adjusted Primary Balance, civilian-labor = civilian labor force, C-section = Cesarean delivery rate, credit-info = depth of credit information, dissatisf = residents’ dissatisfaction, distrib.orient = distributional orientation, EDU = education, eHealth = eHealth index at GP-level, exch.rate = exchange rate, fat = fat consumption, GDP = gross domestic product, GFCF = Gross Fixed Capital Formation/Creation, GH-gas = greenhouse gas, GII = gender inequality index, gov = governance index, gov.revenue = government revenues, HC-coverage = healthcare coverage, HE = health(care) expenditure, HHconsump = household consumption, hosp.beds = hospital beds, hosp.payment = hospital payment scheme, hosp.stay = length of hospital stay, IDI = ICT development index, inc.ineq = income inequality, industry-labor = industrial labor force, infant-sex = infant sex ratio, labor-product = labor production, LBW = low birth weight, leave = work leave, life-satisf = life satisfaction, M-age = maternal age, marginal-tax = marginal tax rate, MDs = physicians, mult.preg = multiple pregnancy, NHS = Nation Health System, NO = nitrous oxide emissions, PM10 = particulate matter (PM10) emissions, pop = population size, pop.density = population density, pre-term = pre-term birth rate, prison = prison population, researchE = research&development expenditure, school.ref = compulsory schooling reform, smoke-free = smoke-free places, SO = sulfur oxide emissions, soc.E = social expenditure, soc.workers = social workers, sugar = sugar consumption, terror = terrorism, union = union density, UR = unemployment rate, urban = urbanization, veg-fr = vegetable-and-fruit consumption, welfare = welfare regime, Wwater = wastewater treatment.

https://doi.org/10.1371/journal.pone.0239031.g002

Distances between determinants in Fig 2 are indicative of determinants’ “connectedness” with each other. While the statistical procedure called for higher dimensionality of the model, for demonstration purposes we show here a two-dimensional solution. This simplification unfortunately comes with a caveat. To use the factor smoking as an example, it would appear it stands at a much greater distance from GDP than it does from alcohol. In reality however, smoking was considered together with alcohol consumption [ 21 , 25 , 26 , 52 , 68 ] in just as many studies as it was with GDP [ 21 , 25 , 26 , 52 , 59 ], five. To aid with respect to this apparent shortcoming, we have emphasized the strongest pairwise links. Solid lines connect GDP with health expenditure (HE), unemployment rate (UR), and education (EDU), indicating that the effect of GDP on health, taking into account the effects of the other three determinants as well, was evaluated in between 12 to 16 studies of the 60 included in this review. Tracing the dashed lines, we can also tell that GDP appeared jointly with income inequality, and HE together with either EDU or UR, in anywhere between 8 to 10 of our selected studies. Finally, some weaker but still worth-mentioning “connections” between variables are displayed as well via the dotted lines.

The fact that all notable pairwise “connections” are concentrated within a relatively small region of the plot may be interpreted as low overall “connectedness” among the health indicators studied. GDP is the most widely investigated determinant in relation to general population health. Its total number of “connections” is disproportionately high (159) compared to its runner-up–HE (with 113 “connections”), and then subsequently EDU (with 90) and UR (with 86). In fact, all of these determinants could be thought of as outliers, given that none of the remaining factors have a total count of pairings above 52. This decrease in individual determinants’ overall “connectedness” can be tracked on the graph via the change of color intensity as we move outwards from the symbolic center of GDP and its closest “co-determinants”, to finally reach the other extreme of the ten indicators (welfare regime, household consumption, compulsory school reform, life satisfaction, government revenues, literacy, research expenditure, multiple pregnancy, Cyclically Adjusted Primary Balance, and residents’ dissatisfaction; in white) the effects on health of which were only studied in isolation.

Lastly, we point to the few small but stable clusters of covariates encircled by the grey bubbles on Fig 2 . These groups of determinants were identified as “close” by both statistical procedures used for the production of the graph (see details in S2 Appendix ).

Statistical methodology

There was great variation in the level of statistical detail reported. Some authors provided too vague a description of their analytical approach, necessitating some inference in this section.

The issue of missing data is a challenging reality in this field of research, but few of the studies under review (12/60) explain how they dealt with it. Among the ones that do, three general approaches to handling missingness can be identified, listed in increasing level of sophistication: case-wise deletion, i.e., removal of countries from the sample [ 20 , 45 , 48 , 58 , 59 ], (linear) interpolation [ 28 , 30 , 34 , 58 , 59 , 63 ], and multiple imputation [ 26 , 41 , 52 ].

Correlations, Pearson, Spearman, or unspecified, were the only technique applied with respect to the health outcomes of interest in eight analyses [ 33 , 42 – 44 , 46 , 53 , 57 , 61 ]. Among the more advanced statistical methods, the family of regression models proved to be, by and large, predominant. Before examining this closer, we note the techniques that were, in a way, “unique” within this selection of studies: meta-analyses were performed (random and fixed effects, respectively) on the reduced form and 2-sample two stage least squares (2SLS) estimations done within countries [ 39 ]; difference-in-difference (DiD) analysis was applied in one case [ 23 ]; dynamic time-series methods, among which co-integration, impulse-response function (IRF), and panel vector autoregressive (VAR) modeling, were utilized in one study [ 80 ]; longitudinal generalized estimating equation (GEE) models were developed on two occasions [ 70 , 78 ]; hierarchical Bayesian spatial models [ 51 ] and special autoregressive regression [ 62 ] were also implemented.

Purely cross-sectional data analyses were performed in eight studies [ 25 , 45 , 47 , 50 , 55 , 56 , 67 , 71 ]. These consisted of linear regression (assumed ordinary least squares (OLS)), generalized least squares (GLS) regression, and multilevel analyses. However, six other studies that used longitudinal data in fact had a cross-sectional design, through which they applied regression at multiple time-points separately [ 27 , 29 , 36 , 48 , 68 , 72 ].

Apart from these “multi-point cross-sectional studies”, some other simplistic approaches to longitudinal data analysis were found, involving calculating and regressing 3-year averages of both the response and the predictor variables [ 54 ], taking the average of a few data-points (i.e., survey waves) [ 56 ] or using difference scores over 10-year [ 19 , 29 ] or unspecified time intervals [ 40 , 55 ].

Moving further in the direction of more sensible longitudinal data usage, we turn to the methods widely known among (health) economists as “panel data analysis” or “panel regression”. Most often seen were models with fixed effects for country/region and sometimes also time-point (occasionally including a country-specific trend as well), with robust standard errors for the parameter estimates to take into account correlations among clustered observations [ 20 , 21 , 24 , 28 , 30 , 32 , 34 , 37 , 38 , 41 , 52 , 59 , 60 , 63 , 66 , 69 , 73 , 79 , 81 , 82 ]. The Hausman test [ 83 ] was sometimes mentioned as the tool used to decide between fixed and random effects [ 26 , 49 , 63 , 66 , 73 , 82 ]. A few studies considered the latter more appropriate for their particular analyses, with some further specifying that (feasible) GLS estimation was employed [ 26 , 34 , 49 , 58 , 60 , 73 ]. Apart from these two types of models, the first differences method was encountered once as well [ 31 ]. Across all, the error terms were sometimes assumed to come from a first-order autoregressive process (AR(1)), i.e., they were allowed to be serially correlated [ 20 , 30 , 38 , 58 – 60 , 73 ], and lags of (typically) predictor variables were included in the model specification, too [ 20 , 21 , 37 , 38 , 48 , 69 , 81 ]. Lastly, a somewhat different approach to longitudinal data analysis was undertaken in four studies [ 22 , 35 , 48 , 65 ] in which multilevel–linear or Poisson–models were developed.

Regardless of the exact techniques used, most studies included in this review presented multiple model applications within their main analysis. None attempted to formally compare models in order to identify the “best”, even if goodness-of-fit statistics were occasionally reported. As indicated above, many studies investigated women’s and men’s health separately [ 19 , 21 , 22 , 27 – 29 , 31 , 33 , 35 , 36 , 38 , 39 , 45 , 50 , 51 , 64 , 65 , 69 , 82 ], and covariates were often tested one at a time, including other covariates only incrementally [ 20 , 25 , 28 , 36 , 40 , 50 , 55 , 67 , 73 ]. Furthermore, there were a few instances where analyses within countries were performed as well [ 32 , 39 , 51 ] or where the full time period of interest was divided into a few sub-periods [ 24 , 26 , 28 , 31 ]. There were also cases where different statistical techniques were applied in parallel [ 29 , 55 , 60 , 66 , 69 , 73 , 82 ], sometimes as a form of sensitivity analysis [ 24 , 26 , 30 , 58 , 73 ]. However, the most common approach to sensitivity analysis was to re-run models with somewhat different samples [ 39 , 50 , 59 , 67 , 69 , 80 , 82 ]. Other strategies included different categorization of variables or adding (more/other) controls [ 21 , 23 , 25 , 28 , 37 , 50 , 63 , 69 ], using an alternative main covariate measure [ 59 , 82 ], including lags for predictors or outcomes [ 28 , 30 , 58 , 63 , 65 , 79 ], using weights [ 24 , 67 ] or alternative data sources [ 37 , 69 ], or using non-imputed data [ 41 ].

As the methods and not the findings are the main focus of the current review, and because generic checklists cannot discern the underlying quality in this application field (see also below), we opted to pool all reported findings together, regardless of individual study characteristics or particular outcome(s) used, and speak generally of positive and negative effects on health. For this summary we have adopted the 0.05-significance level and only considered results from multivariate analyses. Strictly birth-related factors are omitted since these potentially only relate to the group of infant mortality indicators and not to any of the other general population health measures.

Starting with the determinants most often studied, higher GDP levels [ 21 , 26 , 27 , 29 , 30 , 32 , 43 , 48 , 52 , 58 , 60 , 66 , 67 , 73 , 79 , 81 , 82 ], higher health [ 21 , 37 , 47 , 49 , 52 , 58 , 59 , 68 , 72 , 82 ] and social [ 20 , 21 , 26 , 38 , 79 ] expenditures, higher education [ 26 , 39 , 52 , 62 , 72 , 73 ], lower unemployment [ 60 , 61 , 66 ], and lower income inequality [ 30 , 42 , 53 , 55 , 73 ] were found to be significantly associated with better population health on a number of occasions. In addition to that, there was also some evidence that democracy [ 36 ] and freedom [ 50 ], higher work compensation [ 43 , 79 ], distributional orientation [ 54 ], cigarette prices [ 63 ], gross national income [ 22 , 72 ], labor productivity [ 26 ], exchange rates [ 32 ], marginal tax rates [ 79 ], vaccination rates [ 52 ], total fertility [ 59 , 66 ], fruit and vegetable [ 68 ], fat [ 52 ] and sugar consumption [ 52 ], as well as bigger depth of credit information [ 22 ] and percentage of civilian labor force [ 79 ], longer work leaves [ 41 , 58 ], more physicians [ 37 , 52 , 72 ], nurses [ 72 ], and hospital beds [ 79 , 82 ], and also membership in associations, perceived corruption and societal trust [ 48 ] were beneficial to health. Higher nitrous oxide (NO) levels [ 52 ], longer average hospital stay [ 48 ], deprivation [ 51 ], dissatisfaction with healthcare and the social environment [ 56 ], corruption [ 40 , 50 ], smoking [ 19 , 26 , 52 , 68 ], alcohol consumption [ 26 , 52 , 68 ] and illegal drug use [ 68 ], poverty [ 64 ], higher percentage of industrial workers [ 26 ], Gross Fixed Capital creation [ 66 ] and older population [ 38 , 66 , 79 ], gender inequality [ 22 ], and fertility [ 26 , 66 ] were detrimental.

It is important to point out that the above-mentioned effects could not be considered stable either across or within studies. Very often, statistical significance of a given covariate fluctuated between the different model specifications tried out within the same study [ 20 , 49 , 59 , 66 , 68 , 69 , 73 , 80 , 82 ], testifying to the importance of control variables and multivariate research (i.e., analyzing multiple independent variables simultaneously) in general. Furthermore, conflicting results were observed even with regards to the “core” determinants given special attention, so to speak, throughout this text. Thus, some studies reported negative effects of health expenditure [ 32 , 82 ], social expenditure [ 58 ], GDP [ 49 , 66 ], and education [ 82 ], and positive effects of income inequality [ 82 ] and unemployment [ 24 , 31 , 32 , 52 , 66 , 68 ]. Interestingly, one study [ 34 ] differentiated between temporary and long-term effects of GDP and unemployment, alluding to possibly much greater complexity of the association with health. It is also worth noting that some gender differences were found, with determinants being more influential for males than for females, or only having statistically significant effects for male health [ 19 , 21 , 28 , 34 , 36 , 37 , 39 , 64 , 65 , 69 ].

The purpose of this scoping review was to examine recent quantitative work on the topic of multi-country analyses of determinants of population health in high-income countries.

Measuring population health via relatively simple mortality-based indicators still seems to be the state of the art. What is more, these indicators are routinely considered one at a time, instead of, for example, employing existing statistical procedures to devise a more general, composite, index of population health, or using some of the established indices, such as disability-adjusted life expectancy (DALE) or quality-adjusted life expectancy (QALE). Although strong arguments for their wider use were already voiced decades ago [ 84 ], such summary measures surface only rarely in this research field.

On a related note, the greater data availability and accessibility that we enjoy today does not automatically equate to data quality. Nonetheless, this is routinely assumed in aggregate level studies. We almost never encountered a discussion on the topic. The non-mundane issue of data missingness, too, goes largely underappreciated. With all recent methodological advancements in this area [ 85 – 88 ], there is no excuse for ignorance; and still, too few of the reviewed studies tackled the matter in any adequate fashion.

Much optimism can be gained considering the abundance of different determinants that have attracted researchers’ attention in relation to population health. We took on a visual approach with regards to these determinants and presented a graph that links spatial distances between determinants with frequencies of being studies together. To facilitate interpretation, we grouped some variables, which resulted in some loss of finer detail. Nevertheless, the graph is helpful in exemplifying how many effects continue to be studied in a very limited context, if any. Since in reality no factor acts in isolation, this oversimplification practice threatens to render the whole exercise meaningless from the outset. The importance of multivariate analysis cannot be stressed enough. While there is no “best method” to be recommended and appropriate techniques vary according to the specifics of the research question and the characteristics of the data at hand [ 89 – 93 ], in the future, in addition to abandoning simplistic univariate approaches, we hope to see a shift from the currently dominating fixed effects to the more flexible random/mixed effects models [ 94 ], as well as wider application of more sophisticated methods, such as principle component regression, partial least squares, covariance structure models (e.g., structural equations), canonical correlations, time-series, and generalized estimating equations.

Finally, there are some limitations of the current scoping review. We searched the two main databases for published research in medical and non-medical sciences (PubMed and Web of Science) since 2013, thus potentially excluding publications and reports that are not indexed in these databases, as well as older indexed publications. These choices were guided by our interest in the most recent (i.e., the current state-of-the-art) and arguably the highest-quality research (i.e., peer-reviewed articles, primarily in indexed non-predatory journals). Furthermore, despite holding a critical stance with regards to some aspects of how determinants-of-health research is currently conducted, we opted out of formally assessing the quality of the individual studies included. The reason for that is two-fold. On the one hand, we are unaware of the existence of a formal and standard tool for quality assessment of ecological designs. And on the other, we consider trying to score the quality of these diverse studies (in terms of regional setting, specific topic, outcome indices, and methodology) undesirable and misleading, particularly since we would sometimes have been rating the quality of only a (small) part of the original studies—the part that was relevant to our review’s goal.

Our aim was to investigate the current state of research on the very broad and general topic of population health, specifically, the way it has been examined in a multi-country context. We learned that data treatment and analytical approach were, in the majority of these recent studies, ill-equipped or insufficiently transparent to provide clarity regarding the underlying mechanisms of population health in high-income countries. Whether due to methodological shortcomings or the inherent complexity of the topic, research so far fails to provide any definitive answers. It is our sincere belief that with the application of more advanced analytical techniques this continuous quest could come to fruition sooner.

Supporting information

S1 checklist. preferred reporting items for systematic reviews and meta-analyses extension for scoping reviews (prisma-scr) checklist..

https://doi.org/10.1371/journal.pone.0239031.s001

S1 Appendix.

https://doi.org/10.1371/journal.pone.0239031.s002

S2 Appendix.

https://doi.org/10.1371/journal.pone.0239031.s003

  • View Article
  • Google Scholar
  • PubMed/NCBI
  • 75. Dahlgren G, Whitehead M. Policies and Strategies to Promote Equity in Health. Stockholm, Sweden: Institute for Future Studies; 1991.
  • 76. Brunner E, Marmot M. Social Organization, Stress, and Health. In: Marmot M, Wilkinson RG, editors. Social Determinants of Health. Oxford, England: Oxford University Press; 1999.
  • 77. Najman JM. A General Model of the Social Origins of Health and Well-being. In: Eckersley R, Dixon J, Douglas B, editors. The Social Origins of Health and Well-being. Cambridge, England: Cambridge University Press; 2001.
  • 85. Carpenter JR, Kenward MG. Multiple Imputation and its Application. New York: John Wiley & Sons; 2013.
  • 86. Molenberghs G, Fitzmaurice G, Kenward MG, Verbeke G, Tsiatis AA. Handbook of Missing Data Methodology. Boca Raton: Chapman & Hall/CRC; 2014.
  • 87. van Buuren S. Flexible Imputation of Missing Data. 2nd ed. Boca Raton: Chapman & Hall/CRC; 2018.
  • 88. Enders CK. Applied Missing Data Analysis. New York: Guilford; 2010.
  • 89. Shayle R. Searle GC, Charles E. McCulloch. Variance Components: John Wiley & Sons, Inc.; 1992.
  • 90. Agresti A. Foundations of Linear and Generalized Linear Models. Hoboken, New Jersey: John Wiley & Sons Inc.; 2015.
  • 91. Leyland A. H. (Editor) HGE. Multilevel Modelling of Health Statistics: John Wiley & Sons Inc; 2001.
  • 92. Garrett Fitzmaurice MD, Geert Verbeke, Geert Molenberghs. Longitudinal Data Analysis. New York: Chapman and Hall/CRC; 2008.
  • 93. Wolfgang Karl Härdle LS. Applied Multivariate Statistical Analysis. Berlin, Heidelberg: Springer; 2015.
  • Systematic review
  • Open access
  • Published: 19 June 2020

Quantitative measures of health policy implementation determinants and outcomes: a systematic review

  • Peg Allen   ORCID: orcid.org/0000-0001-7000-796X 1 ,
  • Meagan Pilar 1 ,
  • Callie Walsh-Bailey 1 ,
  • Cole Hooley 2 ,
  • Stephanie Mazzucca 1 ,
  • Cara C. Lewis 3 ,
  • Kayne D. Mettert 3 ,
  • Caitlin N. Dorsey 3 ,
  • Jonathan Purtle 4 ,
  • Maura M. Kepper 1 ,
  • Ana A. Baumann 5 &
  • Ross C. Brownson 1 , 6  

Implementation Science volume  15 , Article number:  47 ( 2020 ) Cite this article

18k Accesses

60 Citations

32 Altmetric

Metrics details

Public policy has tremendous impacts on population health. While policy development has been extensively studied, policy implementation research is newer and relies largely on qualitative methods. Quantitative measures are needed to disentangle differential impacts of policy implementation determinants (i.e., barriers and facilitators) and outcomes to ensure intended benefits are realized. Implementation outcomes include acceptability, adoption, appropriateness, compliance/fidelity, feasibility, penetration, sustainability, and costs. This systematic review identified quantitative measures that are used to assess health policy implementation determinants and outcomes and evaluated the quality of these measures.

Three frameworks guided the review: Implementation Outcomes Framework (Proctor et al.), Consolidated Framework for Implementation Research (Damschroder et al.), and Policy Implementation Determinants Framework (Bullock et al.). Six databases were searched: Medline, CINAHL Plus, PsycInfo, PAIS, ERIC, and Worldwide Political. Searches were limited to English language, peer-reviewed journal articles published January 1995 to April 2019. Search terms addressed four levels: health, public policy, implementation, and measurement. Empirical studies of public policies addressing physical or behavioral health with quantitative self-report or archival measures of policy implementation with at least two items assessing implementation outcomes or determinants were included. Consensus scoring of the Psychometric and Pragmatic Evidence Rating Scale assessed the quality of measures.

Database searches yielded 8417 non-duplicate studies, with 870 (10.3%) undergoing full-text screening, yielding 66 studies. From the included studies, 70 unique measures were identified to quantitatively assess implementation outcomes and/or determinants. Acceptability, feasibility, appropriateness, and compliance were the most commonly measured implementation outcomes. Common determinants in the identified measures were organizational culture, implementation climate, and readiness for implementation, each aspects of the internal setting. Pragmatic quality ranged from adequate to good, with most measures freely available, brief, and at high school reading level. Few psychometric properties were reported.

Conclusions

Well-tested quantitative measures of implementation internal settings were under-utilized in policy studies. Further development and testing of external context measures are warranted. This review is intended to stimulate measure development and high-quality assessment of health policy implementation outcomes and determinants to help practitioners and researchers spread evidence-informed policies to improve population health.

Registration

Not registered

Peer Review reports

Contributions to the literature

This systematic review identified 70 quantitative measures of implementation outcomes or determinants in health policy studies.

Readiness to implement and organizational climate and culture were commonly assessed determinants, but fewer studies assessed policy actor relationships or implementation outcomes of acceptability, fidelity/compliance, appropriateness, feasibility, or implementation costs.

Study team members rated most identified measures’ pragmatic properties as good, meaning they are straightforward to use, but few studies documented pilot or psychometric testing of measures.

Further development and dissemination of valid and reliable measures of policy implementation outcomes and determinants can facilitate identification, use, and spread of effective policy implementation strategies.

Despite major impacts of policy on population health [ 1 , 2 , 3 , 4 , 5 , 6 , 7 ], there have been relatively few policy studies in dissemination and implementation (D&I) science to inform implementation strategies and evaluate implementation efforts [ 8 ]. While health outcomes of policies are commonly studied, fewer policy studies assess implementation processes and outcomes. Of 146 D&I studies funded by the National Institutes of Health (NIH) through D&I funding announcements from 2007 to 2014, 12 (8.2%) were policy studies that assessed policy content, policy development processes, or health outcomes of policies, representing 10.5% of NIH D&I funding [ 8 ]. Eight of the 12 studies (66.7%) assessed health outcomes, while only five (41.6%) assessed implementation [ 8 ].

Our ability to explore the differential impact of policy implementation determinants and outcomes and disentangle these from health benefits and other societal outcomes requires high quality quantitative measures [ 9 ]. While systematic reviews of measures of implementation of evidence-based interventions (in clinical and community settings) have been conducted in recent years [ 10 , 11 , 12 , 13 ], to our knowledge, no reviews have explored the quality of quantitative measures of determinants and outcomes of policy implementation.

Policy implementation research in political science and the social sciences has been active since at least the 1970s and has much to contribute to the newer field of D&I research [ 1 , 14 ]. Historically, theoretical frameworks and policy research largely emphasized policy development or analysis of the content of policy documents themselves [ 15 ]. For example, Kingdon’s Multiple Streams Framework and its expansions have been widely used in political science and the social sciences more broadly to describe how factors related to sociopolitical climate, attributes of a proposed policy, and policy actors (e.g., organizations, sectors, individuals) contribute to policy change [ 16 , 17 , 18 ]. Policy frameworks can also inform implementation planning and evaluation in D&I research. Although authors have named policy stages since the 1950s [ 19 , 20 ], Sabatier and Mazmanian’s Policy Implementation Process Framework was one of the first such frameworks that gained widespread use in policy implementation research [ 21 ] and later in health promotion [ 22 ]. Yet, available implementation frameworks are not often used to guide implementation strategies or inform why a policy worked in one setting but not another [ 23 ]. Without explicit focus on implementation, the intended benefits of health policies may go unrealized, and the ability may be lost to move the field forward to understand policy implementation (i.e., our collective knowledge building is dampened) [ 24 ].

Differences in perspectives and terminology between D&I and policy research in political science are noteworthy to interpret the present review. For example, Proctor et al. use the term implementation outcomes for what policy researchers call policy outputs [ 14 , 20 , 25 ]. To non-D&I policy researchers, policy implementation outcomes refer to the health outcomes in the target population [ 20 ]. D&I science uses the term fidelity [ 26 ]; policy researchers write about compliance [ 20 ]. While D&I science uses the terms outer setting, outer context, or external context to point to influences outside the implementing organization [ 26 , 27 , 28 ], non-D&I policy research refers to policy fields [ 24 ] which are networks of agencies that carry out policies and programs.

Identification of valid and reliable quantitative measures of health policy implementation processes is needed. These measures are needed to advance from classifying constructs to understanding causality in policy implementation research [ 29 ]. Given limited resources, policy implementers also need to know which aspects of implementation are key to improve policy acceptance, compliance, and sustainability to reap the intended health benefits [ 30 ]. Both pragmatic and psychometrically sound measures are needed to accomplish these objectives [ 10 , 11 , 31 , 32 ], so the field can explore the influence of nuanced determinants and generate reliable and valid findings.

To fill this void in the literature, this systematic review of health policy implementation measures aimed to (1) identify quantitative measures used to assess health policy implementation outcomes (IOF outcomes commonly called policy outputs in policy research) and inner and outer setting determinants, (2) describe and assess pragmatic quality of policy implementation measures, (3) describe and assess the quality of psychometric properties of identified instruments, and (4) elucidate health policy implementation measurement gaps.

The study team used systematic review procedures developed by Lewis and colleagues for reviews of D&I research measures and received detailed guidance from the Lewis team coauthors for each step [ 10 , 11 ]. We followed the PRISMA reporting guidelines as shown in the checklist (Supplemental Table 1 ). We have also provided a publicly available website of measures identified in this review ( https://www.health-policy-measures.org/ ).

For the purposes of this review, policy and policy implementation are defined as follows. We deemed public policy to include legislation at the federal, state/province/regional unit, or local levels; and governmental regulations, whether mandated by national, state/province, or local level governmental agencies or boards of elected officials (e.g., state boards of education in the USA) [ 4 , 20 ]. Here, public policy implementation is defined as the carrying out of a governmental mandate by public or private organizations and groups of organizations [ 20 ].

Two widely used frameworks from the D&I field guide the present review, and a third recently developed framework that bridges policy and D&I research. In the Implementation Outcomes Framework (IOF), Proctor and colleagues identify and define eight implementation outcomes that are differentiated from health outcomes: acceptability, adoption, appropriateness, cost, feasibility, fidelity, penetration, and sustainability [ 25 ]. In the Consolidated Framework for Implementation Research (CFIR), Damschroder and colleagues articulate determinants of implementation including the domains of intervention characteristics, outer setting, inner setting of an organization, characteristics of individuals within organizations, and process [ 33 ]. Finally, Bullock developed the Policy Implementation Determinants Framework to present a balanced framework that emphasizes both internal setting constructs and external setting constructs including policy actor relationships and networks, political will for implementation, and visibility of policy actors [ 34 ]. The constructs identified in these frameworks were used to guide our list of implementation determinants and outcomes.

Through EBSCO, we searched MEDLINE, PsycInfo, and CINAHL Plus. Through ProQuest, we searched PAIS, Worldwide Political, and ERIC. Due to limited time and staff in the 12-month study, we did not search the grey literature. We used multiple search terms in each of four required levels: health, public policy, implementation, and measurement (Table 1 ). Table 1 shows search terms for each string. Supplemental Tables 2 and 3 show the final search syntax applied in EBSCO and ProQuest.

The authors developed the search strings and terms based on policy implementation framework reviews [ 34 , 35 ], additional policy implementation frameworks [ 21 , 22 ], labels and definitions of the eight implementation outcomes identified by Proctor et al. [ 25 ], CFIR construct labels and definitions [ 9 , 33 ], and additional D&I research and search term sources [ 28 , 36 , 37 , 38 ] (Table 1 ). The full study team provided three rounds of feedback on draft terms, and a library scientist provided additional synonyms and search terms. For each test search, we calculated the percentage of 18 benchmark articles the search captured. We determined a priori 80% as an acceptable level of precision.

Inclusion and exclusion criteria

This review addressed only measures of implementation by organizations mandated to act by governmental units or legislation. Measures of behavior changes by individuals in target populations as a result of legislation or governmental regulations and health status changes were outside the realm of this review.

There were several inclusion criteria: (1) empirical studies of the implementation of public policies already passed or approved that addressed physical or behavioral health, (2) quantitative self-report or archival measurement methods utilized, (3) published in peer-reviewed journals from January 1995 through April 2019, (4) published in the English language, (5) public policy implementation studies from any continent or international governing body, and (6) at least two transferable quantitative self-report or archival items that assessed implementation determinants [ 33 , 34 ] and/or IOF implementation outcomes [ 25 ]. This study sought to identify transferable measures that could be used to assess multiple policies and contexts. Here, a transferable item is defined as one that needed no wording changes or only a change in the referent (e.g., policy title or topic such as tobacco or malaria) to make the item applicable to other policies or settings [ 11 ]. The year 1995 was chosen as a starting year because that is about when web-based quantitative surveying began [ 39 ]. Table 2 provides definitions of the IOF implementation outcomes and the selected determinants of implementation. Broader constructs, such as readiness for implementation, contained multiple categories.

Exclusion criteria in the searches included (1) non-empiric health policy journal articles (e.g., conceptual articles, editorials); (2) narrative and systematic reviews; (3) studies with only qualitative assessment of health policy implementation; (4) empiric studies reported in theses and books; (5) health policy studies that only assessed health outcomes (i.e., target population changes in health behavior or status); (6) bill analyses, stakeholder perceptions assessed to inform policy development, and policy content analyses without implementation assessment; (7) studies of changes made in a private business not encouraged by public policy; and (8) countries with authoritarian regimes. We electronically programmed the searches to exclude policy implementation studies from countries that are not democratically governed due to vast differences in policy environments and implementation factors.

Screening procedures

Citations were downloaded into EndNote version 7.8 and de-duplicated electronically. We conducted dual independent screening of titles and abstracts after two group pilot screening sessions in which we clarified inclusion and exclusion criteria and screening procedures. Abstract screeners used Covidence systematic review software [ 40 ] to code inclusion as yes or no. Articles were included in full-text review if one screener coded it as meeting the inclusion criteria. Full-text screening via dual independent screening was coded in Covidence [ 40 ], with weekly meetings to reach consensus on inclusion/exclusion discrepancies. Screeners also coded one of the pre-identified reasons for exclusion.

Data extraction strategy

Extraction elements included information about (1) measure meta-data (e.g., measure name, total number of items, number of transferable items) and studies (e.g., policy topic, country, setting), (2) development and testing of the measure, (3) implementation outcomes and determinants assessed (Table 2 ), (4) pragmatic characteristics, and (5) psychometric properties. Where needed, authors were emailed to obtain the full measure and measure development information. Two coauthors (MP, CWB) reached consensus on extraction elements. For each included measure, a primary extractor conducted initial entries and coding. Due to time and staff limitations in the 12-month study, we did not search for each empirical use of the measure. A secondary extractor checked the entries, noting any discrepancies for discussion in consensus meetings. Multiple measures in a study were extracted separately.

Quality assessment of measures

To assess the quality of measures, we applied the Psychometric and Pragmatic Evidence Rating Scales (PAPERS) developed by Lewis et al. [ 10 , 11 , 41 , 42 ]. PAPERS includes assessment of five pragmatic instrument characteristics that affect the level of ease or difficulty to use the instrument: brevity (number of items), simplicity of language (readability level), cost (whether it is freely available), training burden (extent of data collection training needed), and analysis burden (ease or difficulty of interpretation of scoring and results). Lewis and colleagues developed the pragmatic domains and rating scales with stakeholder and D&I researchers input [ 11 , 41 , 42 ] and developed the psychometric rating scales in collaboration with D&I researchers [ 10 , 11 , 43 ]. The psychometric rating scale has nine properties (Table 3 ): internal consistency; norms; responsiveness; convergent, discriminant, and known-groups construct validity; predictive and concurrent criterion validity; and structural validity. In both the pragmatic and psychometric scales, reported evidence for each domain is scored from poor (− 1), none/not reported (0), minimal/emerging (1), adequate (2), good (3), or excellent (4). Higher values are indicative of more desirable pragmatic characteristics (e.g., fewer items, freely available, scoring instructions, and interpretations provided) and stronger evidence of psychometric properties (e.g., adequate to excellent reliability and validity) (Supplemental Tables 4 and 5 ).

Data synthesis and presentation

This section describes the synthesis of measure transferability, empiric use study settings and policy topics, and PAPERS scoring. Two coauthors (MP, CWB) consensus coded measures into three categories of item transferability based on quartile item transferability percentages: mostly transferable (≥ 75% of items deemed transferable), partially transferable (25–74% of items deemed transferable), and setting-specific (< 25% of items deemed transferable). Items were deemed transferable if no wording changes or only a change in the referent (e.g., policy title or topic) was needed to make the item applicable to the implementation of other policies or in other settings. Abstractors coded study settings into one of five categories: hospital or outpatient clinics; mental or behavioral health facilities; healthcare cost, access, or quality; schools; community; and multiple. Abstractors also coded policy topics to healthcare cost, access, or quality; mental or behavioral health; infectious or chronic diseases; and other, while retaining documentation of subtopics such as tobacco, physical activity, and nutrition. Pragmatic scores were totaled for the five properties, with possible total scores of − 5 to 20, with higher values indicating greater ease to use the instrument. Psychometric property total scores for the nine properties were also calculated, with possible scores of − 9 to 36, with higher values indicating evidence of multiple types of validity.

The database searches yielded 11,684 articles, of which 3267 were duplicates (Fig. 1 ). Titles and abstracts of the 8417 articles were independently screened by two team members; 870 (10.3%) were selected for full-text screening by at least one screener. Of the 870 studies, 804 were excluded at full-text screening or during extraction attempts with the consensus of two coauthors; 66 studies were included. Two coauthors (MP, CWB) reached consensus on extraction and coding of information on 70 unique quantitative eligible measures identified in the 66 included studies plus measure development articles where obtained. Nine measures were used in more than one included study. Detailed information on identified measures is publicly available at https://www.health-policy-measures.org/ .

figure 1

PRISMA flow diagram

The most common exclusion reason was lack of transferable items in quantitative measures of policy implementation ( n = 597) (Fig. 1 ). While this review focused on transferable measures across any health issue or setting, researchers addressing specific health policies or settings may find the excluded studies of interest. The frequencies of the remaining exclusion reasons are listed in Fig. 1 .

A variety of health policy topics and settings from over two dozen countries were found in the database searches. For example, the searches identified quantitative and mixed methods implementation studies of legislation (such as tobacco smoking bans), regulations (such as food/menu labeling requirements), governmental policies that mandated specific clinical practices (such as vaccination or access to HIV antiretroviral treatment), school-based interventions (such as government-mandated nutritional content and physical activity), and other public policies.

Among the 70 unique quantitative implementation measures, 15 measures were deemed mostly transferable (at least 75% transferable, Table 4 ). Twenty-three measures were categorized as partially transferable (25 to 74% of items deemed transferable, Table 5 ); 32 measures were setting-specific (< 25% of items deemed transferable, data not shown).

Implementation outcomes

Among the 70 measures, the most commonly assessed implementation outcomes were fidelity/compliance of the policy implementation to the government mandate (26%), acceptability of the policy to implementers (24%), perceived appropriateness of the policy (17%), and feasibility of implementation (17%) (Table 2 ). Fidelity/compliance was sometimes assessed by asking implementers the extent to which they had modified a mandated practice [ 45 ]. Sometimes, detailed checklists were used to assess the extent of compliance with the many mandated policy components, such as school nutrition policies [ 83 ]. Acceptability was assessed by asking staff or healthcare providers in implementing agencies their level of agreement with the provided statements about the policy mandate, scored in Likert scales. Only eight (11%) of the included measures used multiple transferable items to assess adoption, and only eight (11%) assessed penetration.

Twenty-six measures of implementation costs were found during full-text screening (10 in included studies and 14 in excluded studies, data not shown). The cost time horizon varied from 12 months to 21 years, with most cost measures assessed at multiple time points. Ten of the 26 measures addressed direct implementation costs. Nine studies reported cost modeling findings. The implementation cost survey developed by Vogler et al. was extensive [ 53 ]. It asked implementing organizations to note policy impacts in medication pricing, margins, reimbursement rates, and insurance co-pays.

Determinants of implementation

Within the 70 included measures, the most commonly assessed implementation determinants were readiness for implementation (61% assessed any readiness component) and the general organizational culture and climate (39%), followed by the specific policy implementation climate within the implementation organization/s (23%), actor relationships and networks (17%), political will for policy implementation (11%), and visibility of the policy role and policy actors (10%) (Table 2 ). Each component of readiness for implementation was commonly assessed: communication of the policy (31%, 22 of 70 measures), policy awareness and knowledge (26%), resources for policy implementation (non-training resources 27%, training 20%), and leadership commitment to implement the policy (19%).

Only two studies assessed organizational structure as a determinant of health policy implementation. Lavinghouze and colleagues assessed the stability of the organization, defined as whether re-organization happens often or not, within a set of 9-point Likert items on multiple implementation determinants designed for use with state-level public health practitioners, and assessed whether public health departments were stand-alone agencies or embedded within agencies addressing additional services, such as social services [ 69 ]. Schneider and colleagues assessed coalition structure as an implementation determinant, including items on the number of organizations and individuals on the coalition roster, number that regularly attend coalition meetings, and so forth [ 72 ].

Tables of measures

Tables 4 and 5 present the 38 measures of implementation outcomes and/or determinants identified out of the 70 included measures with at least 25% of items transferable (useable in other studies without wording changes or by changing only the policy name or other referent). Table 4 shows 15 mostly transferable measures (at least 75% transferable). Table 5 shows 23 partially transferable measures (25–74% of items deemed transferable). Separate measure development articles were found for 20 of the 38 measures; the remaining measures seemed to be developed for one-time, study-specific use by the empirical study authors cited in the tables. Studies listed in Tables 4 and 5 were conducted most commonly in the USA ( n = 19) or Europe ( n = 11). A few measures were used elsewhere: Africa ( n = 3), Australia ( n = 1), Canada ( n = 1), Middle East ( n = 1), Southeast Asia ( n = 1), or across multiple continents ( n = 1).

Quality of identified measures

Figure 2 shows the median pragmatic quality ratings across the 38 measures with at least 25% transferable items shown in Tables 4 and 5 . Higher scores are desirable and indicate the measures are easier to use (Table 3 ). Overall, the measures were freely available in the public domain (median score = 4), brief with a median of 11–50 items (median score = 3), and had good readability, with a median reading level between 8th and 12th grade (median score = 3). However, instructions on how to score and interpret item scores were lacking, with a median score of 1, indicating the measures did not include suggestions for interpreting score ranges, clear cutoff scores, and instructions for handling missing data. In general, information on training requirements or availability of self-training manuals on how to use the measures was not reported in the included study or measure development article/s (median score = 0, not reported). Total pragmatic rating scores among the 38 measures with at least 25% of items transferable ranged from 7 to 17 (Tables 4 and 5 ), with a median total score of 12 out of a possible total score of 20. Median scores for each pragmatic characteristic were the same across all measures as for the 38 mostly or partially transferable measures, with a median total score of 11 across all measures.

figure 2

Pragmatic rating scale results across identified measures. Footnote: pragmatic criteria scores from Psychometric and Pragmatic Evidence Rating Scale (PAPERS) (Lewis et al. [ 11 ], Stanick et al. [ 42 ]). Total possible score = 20, total median score across 38 measures = 11. Scores ranged from 0 to 18. Rating scales for each domain are provided in Supplemental Table 4

Few psychometric properties were reported. The study team found few reports of pilot testing and measure refinement as well. Among the 38 measures with at least 25% transferable items, the psychometric properties from the PAPERS rating scale total scores ranged from − 1 to 17 (Tables 4 and 5 ), with a median total score of 5 out of a possible total score of 36. Higher scores indicate more types of validity and reliability were reported with high quality. The 32 measures with calculable norms had a median norms PAPERS score of 3 (good), indicating appropriate sample size and distribution. The nine measures with reported internal consistency mostly showed Cronbach’s alphas in the adequate (0.70 to 0.79) to excellent (≥ 90) range, with a median of 0.78 (PAPERS score of 2, adequate) indicating adequate internal consistency. The five measures with reported structural validity had a median PAPERS score of 2, adequate (range 1 to 3, poor to good), indicating the sample size was sufficient and the factor analysis goodness of fit was reasonable. Among the 38 measures, no reports were found for responsiveness, convergent validity, discriminant validity, known-groups construct validity, or predictive or concurrent criterion validity.

In this systematic review, we sought to identify quantitative measures used to assess health policy implementation outcomes and determinants, rate the pragmatic and psychometric quality of identified measures, and point to future directions to address measurement gaps. In general, the identified measures are easy to use and freely available, but we found little data on validity and reliability. We found more quantitative measures of intra-organizational determinants of policy implementation than measures of the relationships and interactions between organizations that influence policy implementation. We found a limited number of measures that had been developed for or used to assess one of the eight IOF policy implementation outcomes that can be applied to other policies or settings, which may speak more to differences in terms used by policy researchers and D&I researchers than to differences in conceptualizations of policy implementation. Authors used a variety of terms and rarely provided definitions of the constructs the items assessed. Input from experts in policy implementation is needed to better understand and define policy implementation constructs for use across multiple fields involved in policy-related research.

We found several researchers had used well-tested measures of implementation determinants from D&I research or from organizational behavior and management literature (Tables 4 and 5 ). For internal setting of implementing organizations, whether mandated through public policy or not, well-developed and tested measures are available. However, a number of authors crafted their own items, with or without pilot testing, and used a variety of terms to describe what the items assessed. Further dissemination of the availability of well-tested measures to policy researchers is warranted [ 9 , 13 ].

What appears to be a larger gap involves the availability of well-developed and tested quantitative measures of the external context affecting policy implementation that can be used across multiple policy settings and topics [ 9 ]. Lack of attention to how a policy initiative fits with the external implementation context during policymaking and lack of policymaker commitment of adequate resources for implementation contribute to this gap [ 23 , 93 ]. Recent calls and initiatives to integrate health policies during policymaking and implementation planning will bring more attention to external contexts affecting not only policy development but implementation as well [ 93 , 94 , 95 , 96 , 97 , 98 , 99 ]. At the present time, it is not well-known which internal and external determinants are most essential to guide and achieve sustainable policy implementation [ 100 ]. Identification and dissemination of measures that assess factors that facilitate the spread of evidence-informed policy implementation (e.g., relative advantage, flexibility) will also help move policy implementation research forward [ 1 , 9 ].

Given the high potential population health impact of evidence-informed policies, much more attention to policy implementation is needed in D&I research. Few studies from non-D&I researchers reported policy implementation measure development procedures, pilot testing, scoring procedures and interpretation, training of data collectors, or data analysis procedures. Policy implementation research could benefit from the rigor of D&I quantitative research methods. And D&I researchers have much to learn about the contexts and practical aspects of policy implementation and can look to the rich depth of information in qualitative and mixed methods studies from other fields to inform quantitative measure development and testing [ 101 , 102 , 103 ].

Limitations

This systematic review has several limitations. First, the four levels of the search string and multiple search terms in each level were applied only to the title, abstract, and subject headings, due to limitations of the search engines, so we likely missed pertinent studies. Second, a systematic approach with stakeholder input is needed to expand the definitions of IOF implementation outcomes for policy implementation. Third, although the authors value intra-organizational policymaking and implementation, the study team restricted the search to governmental policies due to limited time and staffing in the 12-month study. Fourth, by excluding tools with only policy-specific implementation measures, we excluded some well-developed and tested instruments in abstract and full-text screening. Since only 12 measures had 100% transferable items, researchers may need to pilot test wording modifications of other items. And finally, due to limited time and staffing, we only searched online for measures and measures development articles and may have missed separately developed pragmatic information, such as training and scoring materials not reported in a manuscript.

Despite the limitations, several recommendations for measure development follow from the findings and related literature [ 1 , 11 , 20 , 35 , 41 , 104 ], including the need to (1) conduct systematic, mixed-methods procedures (concept mapping, expert panels) to refine policy implementation outcomes, (2) expand and more fully specify external context domains for policy implementation research and evaluation, (3) identify and disseminate well-developed measures for specific policy topics and settings, (4) ensure that policy implementation improves equity rather than exacerbating disparities [ 105 ], and (5) develop evidence-informed policy implementation guidelines.

Easy-to-use, reliable, and valid quantitative measures of policy implementation can further our understanding of policy implementation processes, determinants, and outcomes. Due to the wide array of health policy topics and implementation settings, sound quantitative measures that can be applied across topics and settings will help speed learnings from individual studies and aid in the transfer from research to practice. Quantitative measures can inform the implementation of evidence-informed policies to further the spread and effective implementation of policies to ultimately reap greater population health benefit. This systematic review of measures is intended to stimulate measure development and high-quality assessment of health policy implementation outcomes and predictors to help practitioners and researchers spread evidence-informed policies to improve population health and reduce inequities.

Availability of data and materials

A compendium of identified measures is available for dissemination at https://www.health-policy-measures.org/ . A link will be provided on the website of the Prevention Research Center, Brown School, Washington University in St. Louis, at https://prcstl.wustl.edu/ . The authors invite interested organizations to provide a link to the compendium. Citations and abstracts of excluded policy-specific measures are available on request.

Abbreviations

Consolidated Framework for Implementation Research

Cumulative Index of Nursing and Allied Health Literature

Dissemination and implementation science

Elton B. Stephens Company

Education Resources Information Center

Implementation Outcomes Framework

Psychometric and Pragmatic Evidence Rating Scale

Preferred Reporting Items for Systematic Reviews and Meta-Analyses

Purtle J, Dodson EA, Brownson RC. Policy dissemination research. In: Brownson RC, Colditz GA, Proctor EK, editors. Dissemination and Implementation Research in Health: Translating Science to Practice, Second Edition. New York: Oxford University Press; 2018.

Google Scholar  

Brownson RC, Baker EA, Deshpande AD, Gillespie KN. Evidence-based public health. Third ed. New York, NY: Oxford University Press; 2018.

Guide to Community Preventive Services. About the community guide.: community preventive services task force; 2020 [updated October 03, 2019; cited 2020. Available from: https://www.thecommunityguide.org/ .

Eyler AA, Chriqui JF, Moreland-Russell S, Brownson RC, editors. Prevention, policy, and public health, first edition. New York, NY: Oxford University Press; 2016.

Andre FE, Booy R, Bock HL, Clemens J, Datta SK, John TJ, et al. Vaccination greatly reduces disease, disability, death, and inequity worldwide. Geneva, Switzerland: World Health Organization; 2008 February 2008. Contract No.: 07-040089.

Cheng JJ, Schuster-Wallace CJ, Watt S, Newbold BK, Mente A. An ecological quantification of the relationships between water, sanitation and infant, child, and maternal mortality. Environ Health. 2012;11:4.

PubMed   PubMed Central   Google Scholar  

Levy DT, Li Y, Yuan Z. Impact of nations meeting the MPOWER targets between 2014 and 2016: an update. Tob Control. 2019.

Purtle J, Peters R, Brownson RC. A review of policy dissemination and implementation research funded by the National Institutes of Health, 2007-2014. Implement Sci. 2016;11:1.

Lewis CC, Proctor EK, Brownson RC. Measurement issues in dissemination and implementation research. In: Brownson RC, Ga C, Proctor EK, editors. Disssemination and Implementation Research in Health: Translating Science to Practice, Second Edition. New York: Oxford University Press; 2018.

Lewis CC, Fischer S, Weiner BJ, Stanick C, Kim M, Martinez RG. Outcomes for implementation science: an enhanced systematic review of instruments using evidence-based rating criteria. Implement Sci. 2015;10:155.

Lewis CC, Mettert KD, Dorsey CN, Martinez RG, Weiner BJ, Nolen E, et al. An updated protocol for a systematic review of implementation-related measures. Syst Rev. 2018;7(1):66.

Chaudoir SR, Dugan AG, Barr CH. Measuring factors affecting implementation of health innovations: a systematic review of structural, organizational, provider, patient, and innovation level measures. Implement Sci. 2013;8:22.

Rabin BA, Lewis CC, Norton WE, Neta G, Chambers D, Tobin JN, et al. Measurement resources for dissemination and implementation research in health. Implement Sci. 2016;11:42.

Nilsen P, Stahl C, Roback K, Cairney P. Never the twain shall meet?--a comparison of implementation science and policy implementation research. Implement Sci. 2013;8:63.

Sabatier PA, editor. Theories of the Policy Process. New York, NY: Routledge; 2019.

Kingdon J. Agendas, alternatives, and public policies, second edition. Second ed. New York: Longman; 1995.

Jones MD, Peterson HL, Pierce JJ, Herweg N, Bernal A, Lamberta Raney H, et al. A river runs through it: a multiple streams meta-review. Policy Stud J. 2016;44(1):13–36.

Fowler L. Using the multiple streams framework to connect policy adoption to implementation. Policy Studies Journal. 2020 (11 Feb).

Howlett M, Mukherjee I, Woo JJ. From tools to toolkits in policy design studies: the new design orientation towards policy formulation research. Policy Polit. 2015;43(2):291–311.

Natesan SD, Marathe RR. Literature review of public policy implementation. Int J Public Policy. 2015;11(4):219–38.

Sabatier PA, Mazmanian. Implementation of public policy: a framework of analysis. Policy Studies Journal. 1980 (January).

Sabatier PA. Theories of the Policy Process. Westview; 2007.

Tomm-Bonde L, Schreiber RS, Allan DE, MacDonald M, Pauly B, Hancock T, et al. Fading vision: knowledge translation in the implementation of a public health policy intervention. Implement Sci. 2013;8:59.

Roll S, Moulton S, Sandfort J. A comparative analysis of two streams of implementation research. Journal of Public and Nonprofit Affairs. 2017;3(1):3–22.

Proctor E, Silmere H, Raghavan R, Hovmand P, Aarons G, Bunger A, et al. Outcomes for implementation research: conceptual distinctions, measurement challenges, and research agenda. Admin Pol Ment Health. 2011;38(2):65–76.

Brownson RC, Colditz GA, Proctor EK, editors. Dissemination and implementation research in health: translating science to practice, second edition. New York: Oxford University Press; 2018.

Tabak RG, Khoong EC, Chambers DA, Brownson RC. Bridging research and practice: models for dissemination and implementation research. Am J Prev Med. 2012;43(3):337–50.

Rabin BA, Brownson RC, Haire-Joshu D, Kreuter MW, Weaver NL. A glossary for dissemination and implementation research in health. J Public Health Manag Pract. 2008;14(2):117–23.

PubMed   Google Scholar  

Lewis CC, Klasnja P, Powell BJ, Lyon AR, Tuzzio L, Jones S, et al. From classification to causality: advancing understanding of mechanisms of change in implementation science. Front Public Health. 2018;6:136.

Boyd MR, Powell BJ, Endicott D, Lewis CC. A method for tracking implementation strategies: an exemplar implementing measurement-based care in community behavioral health clinics. Behav Ther. 2018;49(4):525–37.

Glasgow RE. What does it mean to be pragmatic? Pragmatic methods, measures, and models to facilitate research translation. Health Educ Behav. 2013;40(3):257–65.

Glasgow RE, Riley WT. Pragmatic measures: what they are and why we need them. Am J Prev Med. 2013;45(2):237–43.

Damschroder LJ, Aron DC, Keith RE, Kirsh SR, Alexander JA, Lowery JC. Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science. Implement Sci. 2009;4:50.

Bullock HL. Understanding the implementation of evidence-informed policies and practices from a policy perspective: a critical interpretive synthesis in: How do systems achieve their goals? the role of implementation in mental health systems improvement [Dissertation]. Hamilton, Ontario: McMaster University; 2019.

Watson DP, Adams EL, Shue S, Coates H, McGuire A, Chesher J, et al. Defining the external implementation context: an integrative systematic literature review. BMC Health Serv Res. 2018;18(1):209.

McKibbon KA, Lokker C, Wilczynski NL, Ciliska D, Dobbins M, Davis DA, et al. A cross-sectional study of the number and frequency of terms used to refer to knowledge translation in a body of health literature in 2006: a Tower of Babel? Implement Sci. 2010;5:16.

Terwee CB, Jansma EP, Riphagen II, de Vet HC. Development of a methodological PubMed search filter for finding studies on measurement properties of measurement instruments. Qual Life Res. 2009;18(8):1115–23.

Egan M, Maclean A, Sweeting H, Hunt K. Comparing the effectiveness of using generic and specific search terms in electronic databases to identify health outcomes for a systematic review: a prospective comparative study of literature search method. BMJ Open. 2012;2:3.

Dillman DA, Smyth JD, Christian LM. Internet, mail, and mixed-mode surveys: the tailored design method. Hoboken, NJ: John Wiley & Sons; 2009.

Covidence systematic review software. Melbourne, Australia: Veritas Health Innovation. https://www.covidence.org . Accessed Mar 2019.

Powell BJ, Stanick CF, Halko HM, Dorsey CN, Weiner BJ, Barwick MA, et al. Toward criteria for pragmatic measurement in implementation research and practice: a stakeholder-driven approach using concept mapping. Implement Sci. 2017;12(1):118.

Stanick CF, Halko HM, Nolen EA, Powell BJ, Dorsey CN, Mettert KD, et al. Pragmatic measures for implementation research: development of the Psychometric and Pragmatic Evidence Rating Scale (PAPERS). Transl Behav Med. 2019.

Henrikson NB, Blasi PR, Dorsey CN, Mettert KD, Nguyen MB, Walsh-Bailey C, et al. Psychometric and pragmatic properties of social risk screening tools: a systematic review. Am J Prev Med. 2019;57(6S1):S13–24.

Stirman SW, Miller CJ, Toder K, Calloway A. Development of a framework and coding system for modifications and adaptations of evidence-based interventions. Implement Sci. 2013;8:65.

Lau AS, Brookman-Frazee L. The 4KEEPS study: identifying predictors of sustainment of multiple practices fiscally mandated in children’s mental health services. Implement Sci. 2016;11:1–8.

Ekvall G. Organizational climate for creativity and innovation. European J Work Organizational Psychology. 1996;5(1):105–23.

Lövgren G, Eriksson S, Sandman PO. Effects of an implemented care policy on patient and personnel experiences of care. Scand J Caring Sci. 2002;16(1):3–11.

Dwyer DJ, Ganster DC. The effects of job demands and control on employee attendance and satisfaction. J Organ Behav. 1991;12:595–608.

Condon-Paoloni D, Yeatman HR, Grigonis-Deane E. Health-related claims on food labels in Australia: understanding environmental health officers’ roles and implications for policy. Public Health Nutr. 2015;18(1):81–8.

Patterson MG, West MA, Shackleton VJ, Dawson JF, Lawthom R, Maitlis S, et al. Validating the organizational climate measure: links to managerial practices, productivity and innovation. J Organ Behav. 2005;26:279–408.

Glisson C, Green P, Williams NJ. Assessing the Organizational Social Context (OSC) of child welfare systems: implications for research and practice. Child Abuse Negl. 2012;36(9):621–32.

Beidas RS, Aarons G, Barg F, Evans A, Hadley T, Hoagwood K, et al. Policy to implementation: evidence-based practice in community mental health--study protocol. Implement Sci. 2013;8(1):38.

Eisenberger R, Cummings J, Armeli S, Lynch P. Perceived organizational support, discretionary treatment, and job satisfaction. J Appl Psychol. 1997;82:812–20.

CAS   PubMed   Google Scholar  

Eby L, George K, Brown BL. Going tobacco-free: predictors of clinician reactions and outcomes of the NY state office of alcoholism and substance abuse services tobacco-free regulation. J Subst Abus Treat. 2013;44(3):280–7.

Vogler S, Zimmermann N, de Joncheere K. Policy interventions related to medicines: survey of measures taken in European countries during 2010-2015. Health Policy. 2016;120(12):1363–77.

Wanberg CRB, Banas JT. Predictors and outcomes of openness to change in a reorganizing workplace. J Applied Psychology. 2000;85:132–42.

CAS   Google Scholar  

Hardy LJ, Wertheim P, Bohan K, Quezada JC, Henley E. A model for evaluating the activities of a coalition-based policy action group: the case of Hermosa Vida. Health Promot Pract. 2013;14(4):514–23.

Gavriilidis G, Östergren P-O. Evaluating a traditional medicine policy in South Africa: phase 1 development of a policy assessment tool. Glob Health Action. 2012;5:17271.

Hongoro C, Rutebemberwa E, Twalo T, Mwendera C, Douglas M, Mukuru M, et al. Analysis of selected policies towards universal health coverage in Uganda: the policy implementation barometer protocol. Archives Public Health. 2018;76:12.

Roeseler A, Solomon M, Beatty C, Sipler AM. The tobacco control network’s policy readiness and stage of change assessment: what the results suggest for moving tobacco control efforts forward at the state and territorial levels. J Public Health Manag Pract. 2016;22(1):9–19.

Brämberg EB, Klinga C, Jensen I, Busch H, Bergström G, Brommels M, et al. Implementation of evidence-based rehabilitation for non-specific back pain and common mental health problems: a process evaluation of a nationwide initiative. BMC Health Serv Res. 2015;15(1):79.

Rütten A, Lüschen G, von Lengerke T, Abel T, Kannas L, Rodríguez Diaz JA, et al. Determinants of health policy impact: comparative results of a European policymaker study. Sozial-Und Praventivmedizin. 2003;48(6):379–91.

Smith SN, Lai Z, Almirall D, Goodrich DE, Abraham KM, Nord KM, et al. Implementing effective policy in a national mental health reengagement program for veterans. J Nerv Ment Dis. 2017;205(2):161–70.

Carasso BS, Lagarde M, Cheelo C, Chansa C, Palmer N. Health worker perspectives on user fee removal in Zambia. Hum Resour Health. 2012;10:40.

Goldsmith REH, C.F. Measuring consumer innovativeness. J Acad Mark Sci. 1991;19(3):209–21.

Webster CA, Caputi P, Perreault M, Doan R, Doutis P, Weaver RG. Elementary classroom teachers’ adoption of physical activity promotion in the context of a statewide policy: an innovation diffusion and socio-ecologic perspective. J Teach Phys Educ. 2013;32(4):419–40.

Aarons GA, Glisson C, Hoagwood K, Kelleher K, Landsverk J, Cafri G. Psychometric properties and U.S. National norms of the Evidence-Based Practice Attitude Scale (EBPAS). Psychol Assess. 2010;22(2):356–65.

Gill KJ, Campbell E, Gauthier G, Xenocostas S, Charney D, Macaulay AC. From policy to practice: implementing frontline community health services for substance dependence--study protocol. Implement Sci. 2014;9:108.

Lavinghouze SR, Price AW, Parsons B. The environmental assessment instrument: harnessing the environment for programmatic success. Health Promot Pract. 2009;10(2):176–85.

Bull FC, Milton K, Kahlmeier S. National policy on physical activity: the development of a policy audit tool. J Phys Act Health. 2014;11(2):233–40.

Bull F, Milton K, Kahlmeier S, Arlotti A, Juričan AB, Belander O, et al. Turning the tide: national policy approaches to increasing physical activity in seven European countries. British J Sports Med. 2015;49(11):749–56.

Schneider EC, Smith ML, Ory MG, Altpeter M, Beattie BL, Scheirer MA, et al. State fall prevention coalitions as systems change agents: an emphasis on policy. Health Promot Pract. 2016;17(2):244–53.

Helfrich CD, Savitz LA, Swiger KD, Weiner BJ. Adoption and implementation of mandated diabetes registries by community health centers. Am J Prev Med. 2007;33(1,Suppl):S50-S65.

Donchin M, Shemesh AA, Horowitz P, Daoud N. Implementation of the Healthy Cities’ principles and strategies: an evaluation of the Israel Healthy Cities network. Health Promot Int. 2006;21(4):266–73.

Were MC, Emenyonu N, Achieng M, Shen C, Ssali J, Masaba JP, et al. Evaluating a scalable model for implementing electronic health records in resource-limited settings. J Am Med Inform Assoc. 2010;17(3):237–44.

Konduri N, Sawyer K, Nizova N. User experience analysis of e-TB Manager, a nationwide electronic tuberculosis recording and reporting system in Ukraine. ERJ Open Research. 2017;3:2.

McDonnell E, Probart C. School wellness policies: employee participation in the development process and perceptions of the policies. J Child Nutr Manag. 2008;32:1.

Mersini E, Hyska J, Burazeri G. Evaluation of national food and nutrition policy in Albania. Zdravstveno Varstvo. 2017;56(2):115–23.

Cavagnero E, Daelmans B, Gupta N, Scherpbier R, Shankar A. Assessment of the health system and policy environment as a critical complement to tracking intervention coverage for maternal, newborn, and child health. Lancet. 2008;371 North American Edition(9620):1284-93.

Lehman WE, Greener JM, Simpson DD. Assessing organizational readiness for change. J Subst Abus Treat. 2002;22(4):197–209.

Pankratz M, Hallfors D, Cho H. Measuring perceptions of innovation adoption: the diffusion of a federal drug prevention policy. Health Educ Res. 2002;17(3):315–26.

Cook JM, Thompson R, Schnurr PP. Perceived characteristics of intervention scale: development and psychometric properties. Assessment. 2015;22(6):704–14.

Probart C, McDonnell ET, Jomaa L, Fekete V. Lessons from Pennsylvania’s mixed response to federal school wellness law. Health Aff. 2010;29(3):447–53.

Probart C, McDonnell E, Weirich JE, Schilling L, Fekete V. Statewide assessment of local wellness policies in Pennsylvania public school districts. J Am Diet Assoc. 2008;108(9):1497–502.

Rakic S, Novakovic B, Stevic S, Niskanovic J. Introduction of safety and quality standards for private health care providers: a case-study from the Republic of Srpska, Bosnia and Herzegovina. Int J Equity Health. 2018;17(1):92.

Rozema AD, Mathijssen JJP, Jansen MWJ, van Oers JAM. Sustainability of outdoor school ground smoking bans at secondary schools: a mixed-method study. Eur J Pub Health. 2018;28(1):43–9.

Barbero C, Moreland-Russell S, Bach LE, Cyr J. An evaluation of public school district tobacco policies in St. Louis County, Missouri. J Sch Health. 2013;83(8):525–32.

Williams KM, Kirsh S, Aron D, Au D, Helfrich C, Lambert-Kerzner A, et al. Evaluation of the Veterans Health Administration’s specialty care transformational initiatives to promote patient-centered delivery of specialty care: a mixed-methods approach. Telemed J E-Health. 2017;23(7):577–89.

Spencer E, Walshe K. National quality improvement policies and strategies in European healthcare systems. Quality Safety Health Care. 2009;18(Suppl 1):i22–i7.

Assunta M, Dorotheo EU. SEATCA Tobacco Industry Interference Index: a tool for measuring implementation of WHO Framework Convention on Tobacco Control Article 5.3. Tob Control. 2016;25(3):313–8.

Tummers L. Policy alienation of public professionals: the construct and its measurement. Public Adm Rev. 2012;72(4):516–25.

Tummers L, Bekkers V. Policy implementation, street-level bureaucracy, and the importance of discretion. Public Manag Rev. 2014;16(4):527–47.

Raghavan R, Bright CL, Shadoin AL. Toward a policy ecology of implementation of evidence-based practices in public mental health settings. Implement Sci. 2008;3:26.

Peters D, Harting J, van Oers H, Schuit J, de Vries N, Stronks K. Manifestations of integrated public health policy in Dutch municipalities. Health Promot Int. 2016;31(2):290–302.

Tosun J, Lang A. Policy integration: mapping the different concepts. Policy Studies. 2017;38(6):553–70.

Tubbing L, Harting J, Stronks K. Unravelling the concept of integrated public health policy: concept mapping with Dutch experts from science, policy, and practice. Health Policy. 2015;119(6):749–59.

Donkin A, Goldblatt P, Allen J, Nathanson V, Marmot M. Global action on the social determinants of health. BMJ Glob Health. 2017;3(Suppl 1):e000603-e.

Baum F, Friel S. Politics, policies and processes: a multidisciplinary and multimethods research programme on policies on the social determinants of health inequity in Australia. BMJ Open. 2017;7(12):e017772-e.

Delany T, Lawless A, Baum F, Popay J, Jones L, McDermott D, et al. Health in All Policies in South Australia: what has supported early implementation? Health Promot Int. 2016;31(4):888–98.

Valaitis R, MacDonald M, Kothari A, O'Mara L, Regan S, Garcia J, et al. Moving towards a new vision: implementation of a public health policy intervention. BMC Public Health. 2016;16:412.

Bennett LM, Gadlin H, Marchand, C. Collaboration team science: a field guide. Bethesda, MD: National Cancer Institute, National Institutes of Health; 2018. Contract No.: NIH Publication No. 18-7660.

Mazumdar M, Messinger S, Finkelstein DM, Goldberg JD, Lindsell CJ, Morton SC, et al. Evaluating academic scientists collaborating in team-based research: a proposed framework. Acad Med. 2015;90(10):1302–8.

Brownson RC, Fielding JE, Green LW. Building capacity for evidence-based public health: reconciling the pulls of practice and the push of research. Annu Rev Public Health. 2018;39:27–53.

Brownson RC, Colditz GA, Proctor EK. Future issues in dissemination and implementation research. In: Brownson RC, Colditz GA, Proctor EK, editors. Dissemination and Implementation Research in Health: Translating Science to Practice. Second Edition ed. New York: Oxford University Press; 2018.

Thomson K, Hillier-Brown F, Todd A, McNamara C, Huijts T, Bambra C. The effects of public health policies on health inequalities in high-income countries: an umbrella review. BMC Public Health. 2018;18(1):869.

Download references

Acknowledgements

The authors are grateful for the policy expertise and guidance of Alexandra Morshed and the administrative support of Mary Adams, Linda Dix, and Cheryl Valko at the Prevention Research Center, Brown School, Washington University in St. Louis. We thank Lori Siegel, librarian, Brown School, Washington University in St. Louis, for assistance with search terms and procedures. We appreciate the D&I contributions of Enola Proctor and Byron Powell at the Brown School, Washington University in St. Louis, that informed this review. We thank Russell Glasgow, University of Colorado Denver, for guidance on the overall review and pragmatic measure criteria.

This project was funded March 2019 through February 2020 by the Foundation for Barnes-Jewish Hospital, with support from the Washington University in St. Louis Institute of Clinical and Translational Science Pilot Program, NIH/National Center for Advancing Translational Sciences (NCATS) grant UL1 TR002345. The project was also supported by the National Cancer Institute P50CA244431, Cooperative Agreement number U48DP006395-01-00 from the Centers for Disease Control and Prevention, R01MH106510 from the National Institute of Mental Health, and the National Institute of Diabetes and Digestive and Kidney Diseases award number P30DK020579. The findings and conclusions in this paper are those of the authors and do not necessarily represent the official positions of the Foundation for Barnes-Jewish Hospital, Washington University in St. Louis Institute of Clinical and Translational Science, National Institutes of Health, or the Centers for Disease Control and Prevention.

Author information

Authors and affiliations.

Prevention Research Center, Brown School, Washington University in St. Louis, One Brookings Drive, Campus Box 1196, St. Louis, MO, 63130, USA

Peg Allen, Meagan Pilar, Callie Walsh-Bailey, Stephanie Mazzucca, Maura M. Kepper & Ross C. Brownson

School of Social Work, Brigham Young University, 2190 FJSB, Provo, UT, 84602, USA

Cole Hooley

Kaiser Permanente Washington Health Research Institute, 1730 Minor Ave, Seattle, WA, 98101, USA

Cara C. Lewis, Kayne D. Mettert & Caitlin N. Dorsey

Department of Health Management & Policy, Drexel University Dornsife School of Public Health, Nesbitt Hall, 3215 Market St, Philadelphia, PA, 19104, USA

Jonathan Purtle

Brown School, Washington University in St. Louis, One Brookings Drive, Campus Box 1196, St. Louis, MO, 63130, USA

Ana A. Baumann

Department of Surgery (Division of Public Health Sciences) and Alvin J. Siteman Cancer Center, Washington University School of Medicine, 4921 Parkview Place, Saint Louis, MO, 63110, USA

Ross C. Brownson

You can also search for this author in PubMed   Google Scholar

Contributions

Review methodology and quality assessment scale: CCL, KDM, CND. Eligibility criteria: PA, RCB, CND, KDM, SM, MP, JP. Search strings and terms: CH, PA, MP with review by AB, RCB, CND, CCL, MMK, SM, KDM. Framework selection: PA, AB, CH, MP. Abstract screening: PA, CH, MMK, SM, MP. Full-text screening: PA, CH, MP. Pilot extraction: PA, DNC, CH, KDM, SM, MP. Data extraction: MP, CWB. Data aggregation: MP, CWB. Writing: PA, RCB, JP. Editing: RCB, JP, SM, AB, CD, CH, MMK, CCL, KM, MP, CWB. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Peg Allen .

Ethics declarations

Ethics approval and consent to participate.

Not applicable

Consent for publication

Competing interests.

The authors declare they have no conflicting interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1: table s1.

. PRISMA checklist. Table S2 . Electronic search terms for databases searched through EBSCO. Table S3 . Electronic search terms for searches conducted through PROQUEST. Table S4: PAPERS Pragmatic rating scales. Table S5 . PAPERS Psychometric rating scales.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Allen, P., Pilar, M., Walsh-Bailey, C. et al. Quantitative measures of health policy implementation determinants and outcomes: a systematic review. Implementation Sci 15 , 47 (2020). https://doi.org/10.1186/s13012-020-01007-w

Download citation

Received : 24 March 2020

Accepted : 05 June 2020

Published : 19 June 2020

DOI : https://doi.org/10.1186/s13012-020-01007-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Implementation science
  • Health policy
  • Policy implementation
  • Implementation
  • Public policy
  • Psychometric

Implementation Science

ISSN: 1748-5908

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

peer reviewed journal on quantitative research

Is Quantitative Research Ethical? Tools for Ethically Practicing, Evaluating, and Using Quantitative Research

  • Original Paper
  • Published: 28 April 2017
  • Volume 143 , pages 1–16, ( 2017 )

Cite this article

peer reviewed journal on quantitative research

  • Michael J. Zyphur 1 &
  • Dean C. Pierides 2  

18k Accesses

31 Citations

9 Altmetric

Explore all metrics

This editorial offers new ways to ethically practice, evaluate, and use quantitative research (QR). Our central claim is that ready-made formulas for QR, including ‘best practices’ and common notions of ‘validity’ or ‘objectivity,’ are often divorced from the ethical and practical implications of doing, evaluating, and using QR for specific purposes. To focus on these implications, we critique common theoretical foundations for QR and then recommend approaches to QR that are ‘built for purpose,’ by which we mean designed to ethically address specific problems or situations on terms that are contextually relevant. For this, we propose a new tool for evaluating the quality of QR, which we call ‘relational validity.’ Studies, including their methods and results, are relationally valid when they ethically connect researchers’ purposes with the way that QR is oriented and the ways that it is done—including the concepts and units of analysis invoked, as well as what its ‘methods’ imply more generally. This new way of doing QR can provide the liberty required to address serious worldly problems on terms that are both practical and ethically informed in relation to the problems themselves rather than the confines of existing QR logics and practices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

peer reviewed journal on quantitative research

What is Qualitative in Qualitative Research

Reporting reliability, convergent and discriminant validity with structural equation modeling: a review and best-practice recommendations.

peer reviewed journal on quantitative research

Criteria for Good Qualitative Research: A Comprehensive Review

Abrahamson, E., Berkowitz, H., & Dumez, H. (2016). A more relevant approach to relevance in management studies: An essay on performativity. Academy of Management Review, 41, 367–381.

Article   Google Scholar  

American Psychological Association. (2009). Publication manual of the American Psychological Association (6th ed.). Washington, DC: American Psychological Association.

Google Scholar  

Bettis, R. A., Ethiraj, S., Gambardella, A., Helfat, C., & Mitchell, W. (2016). Creating repeatable cumulative knowledge in strategic management. Strategic Management Journal, 37 (2), 257–261.

*Buchholz, R. A., & Rosenthal, S. B. (2008). The unholy alliance of business and science. Journal of Business Ethics, 78 (1), 199–206.

Campbell, D. T. (1957). Factors relevant to the validity of experiments in social settings. Psychological Bulletin, 54, 297–312.

Campbell, D. T. (1991). Methods for the experimenting society. Evaluation Practice, 12 (3), 223–260.

Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research on teaching. In N. L. Gage (Ed.), Handbook of research on teaching (pp. 171–246). Chicago: Rand McNally.

Cartwright, N. (1993). In defence of this worldly’causality: Comments on van Fraassen’s laws and symmetry. Philosophy and Phenomenological Research, 53 (2), 423–429.

Cartwright, N. (2004). Causation: One word, many things. Philosophy of Science, 71 (5), 805–819.

Cartwright, N. (2006). Well-ordered science: Evidence for use. Philosophy of Science, 73 (5), 981–990.

Cartwright, N. (2007). Hunting causes and using them: Approaches in philosophy and economics . Cambridge: Cambridge University Press.

Book   Google Scholar  

*Collison, D., Cross, S., Ferguson, J., Power, D., & Stevenson, L. (2012). Legal determinants of external finance revisited: The inverse relationship between investor protection and societal well-being. Journal of Business Ethics, 108 (3), 393–410.

Cunliffe, A. L. (2003). Reflexive inquiry in organizational research: Questions and possibilities. Human Relations, 56, 983–1003.

Daston, L. (1995). The moral economy of science. Osiris, 10, 2–24.

Daston, L. (2005). Scientific error and the ethos of belief. Social Research, 72, 1–28.

Davies, W. (2017, January 19). How statistics lost their power—And why we should fear what comes next. The Guardian . Retrieved from https://www.theguardian.com/politics/2017/jan/19/crisis-of-statistics-big-data-democracy .

Davis, M. S. (1971). That’s interesting! Towards a phenomenology of sociology and a sociology of phenomenology. Philosophy of the Social Sciences, 1 (4), 309–344.

Deetz, S. (1996). Describing differences in approaches to organization science: Rethinking Burrell and Morgan and their legacy. Organization Science, 7, 191–207.

Dewey, J. (1929). The quest for certainty . New York: Minton, Balch, & Co.

Dunn, W. N. (1982). Reforms as arguments. Knowledge, 3 (3), 293–326.

Erturk, I., Froud, J., Johal, S., Leaver, A., & Williams, K. (2013). (How) do devices matter in finance? Journal of Cultural Economy, 6 (3), 336–352.

Ezzamel, M., & Willmott, H. (2014). Registering ‘the ethical’ in organization theory formation: Towards the disclosure of an ‘invisible force’. Organization Studies, 35, 1013–1039.

Falleti, T. G., & Lynch, J. F. (2009). Context and causal mechanisms in political analysis. Comparative Political Studies, 42 (9), 1143–1166.

Farjoun, M., Ansell, C., & Boin, A. (2015). Pragmatism in organization studies: Meeting the challenges of a dynamic and complex world. Organization Science, 26 (6), 1787–1804.

Feldman, M. S., & Orlikowski, W. J. (2011). Theorizing practice and practicing theory. Organization science .

Freeman, R. E. (2002). Toward a new vision for management research: A commentary on “Organizational researcher values, ethical responsibility, and the committed-to-participant research perspective”. Journal of Management Inquiry, 11 (2), 186–189.

Gabbay, D. M., Hartmann, S., & Woods, J. (2011). Handbook of the history of logic: Inductive logic (Vol. 10). Oxford: Elsevier.

Gelman, A. (2015). The connection between varying treatment effects and the crisis of unreplicable research a Bayesian perspective. Journal of Management, 41, 632–643.

Gigerenzer, G., & Marewski, J. N. (2015). Surrogate science the idol of a universal method for scientific inference. Journal of Management, 41, 421–440.

Gigerenzer, G., Swijtink, Z. G., Porter, T. M., Daston, L., Beatty, J., & Krüger, L. (1989). The empire of chance: How probability changed science and everyday life . Cambridge: Cambridge University Press.

*Greenwood, M. (2016). Approving or improving research ethics in management journals. Journal of Business Ethics , 137 , 1–14.

Hacking, I. (1990). The taming of chance . Cambridge: Cambridge University Press.

Hacking, I. (1992a). Statistical language, statistical truth and statistical reason: The self-authentification of a style of scientific reasoning. In E. McMullin (Ed.), The social dimensions of science (Vol. 3, pp. 130–157). Notre Dame: University of Notre Dame Press.

Hacking, I. (1992b). The self-vindication of the laboratory sciences. In A. Pickering (Ed.), Science as practice and culture (pp. 29–64). Chicago: Chicago Unviersity Press.

Hacking, I. (1999). The social construction of what? . Cambridge: Harvard University Press.

Hacking, I. (2001). An introduction to probability and inductive logic . Cambridge: Cambridge University Press.

Hacking, I. (2002). Historical Ontology . Cambridge: Harvard University Press.

Hacking, I. (2006). The emergence of probability: A philosophical study of early ideas about probability, induction and statistical inference . Cambridge: Cambridge University Press.

Hakala, J., & Ylijoki, O.-H. (2001). Research for whom? Research orientations in three academic cultures. Organization, 8 (2), 373–380.

Hardy, C., & Clegg, S. (1997). Relativity without relativism: Reflexivity in post-paradigm organization studies. British Journal of Management, 8, 5–17.

Hardy, C., Phillips, N., & Clegg, S. (2001). Reflexivity in organization and management theory: A study of the production of the research “subject”. Human Relations, 54, 531–560.

*Hill, R. P. (2002). Stalking the poverty consumer a retrospective examination of modern ethical dilemmas. Journal of Business Ethics, 37 (2), 209–219.

*Holland, D., & Albrecht, C. (2013). The worldwide academic field of business ethics: Scholars’ perceptions of the most important issues. Journal of Business Ethics, 117 (4), 777–788.

Howie, D. (2002). Interpreting probability: Controversies and developments in the early twentieth century . Cambridge: Cambridge University Press.

Huhtala, M., Feldt, T., Lämsä, A. M., Mauno, S., & Kinnunen, U. (2011). Does the ethical culture of organisations promote managers’ occupational well-being? Investigating indirect links via ethical strain. Journal of Business Ethics, 101 (2), 231–247.

Jeanes, E. (2016). Are we ethical? Approaches to ethics in management and organisation research. Organization . doi: 10.1177/1350508416656930 .

*Kaptein, M., & Schwartz, M. S. (2008). The effectiveness of business codes: A critical examination of existing studies and the development of an integrated research model. Journal of Business Ethics, 77 (2), 111–127.

*Keeble, J. J., Topiol, S., & Berkeley, S. (2003). Using indicators to measure sustainability performance at a corporate and project level. Journal of Business Ethics, 44 (2), 149–158.

*Kerssens-van Drongelen, I. C., & Fisscher, O. A. (2003). Ethical dilemmas in performance measurement. Journal of Business Ethics, 45 (1), 51–63.

*Knox, S., & Gruar, C. (2007). The application of stakeholder theory to relationship marketing strategy development in a non-profit organization. Journal of Business Ethics, 75 (2), 115–135.

Kozlowski, S. W. J., & Klein, K. J. (2000). A multilevel approach to theory and research in organizations: Contextual, temporal, and emergent processes. In K. J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory, research, and methods in organizations: Foundations, extensions, and new directions (pp. 3–90). San Francisco: Jossey-Bass.

Latour, B., & Woolgar, S. (1986). Laboratory life: The construction of scientific facts . Beverly Hills: Sage.

Law, J. (2009). Seeing like a survey. Cultural Sociology, 3 (2), 239–256.

MacKenzie, D. A., Muniesa, F., & Siu, L. (2007). Do economists make markets? On the performativity of economics . Princeton: Princeton University Press.

Martela, F. (2015). Fallible inquiry with ethical ends-in-view: A pragmatist philosophy of science for organizational research. Organization Studies, 36, 537–563.

*Michalos, A. C. (1988). Editorial. Journal of Business Ethics, 1, 1.

Misangyi, V. F., Greckhamer, T., Furnari, S., Fiss, P. C., Crilly, D., & Aguilera, R. (2017). Embracing causal complexity the emergence of a neo-configurational perspective. Journal of Management, 43 (1), 255–282.

Morgan, G. (2006). Images of organization . Thousand Oaks: Sage.

OED Online. Oxford University Press, (June 2016). Retrieved June 10, 2016, from http://www.oxforddictionaries.com/definition/english/orient .

*Orlitzky, M., Louche, C., Gond, J. P., & Chapple, W. (2015). Unpacking the drivers of corporate social performance: A multilevel, multistakeholder, and multimethod analysis. Journal of Business Ethics . doi: 10.1007/s10551-015-2822-y .

*Painter-Morland, M. (2011). Rethinking responsible agency in corporations: Perspectives from Deleuze and Guattari. Journal of Business Ethics, 101 (1), 83–95.

Panter, A. T., & Sterba, S. K. (Eds.). (2011). Handbook of ethics in quantitative methodology . New York: Routledge.

Parkhurst, J. O., & Abeysinghe, S. (2016). What constitutes “good” evidence for public health and social policy-making? From hierarchies to appropriateness. Social Epistemology, 30 (5–6), 665–679.

Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section on replicability in psychological science a crisis of confidence? Perspectives on Psychological Science, 7 (6), 528–530.

Pedhazur, E. J., & Schmelkin, L. P. (2013). Measurement, design, and analysis: An integrated approach . Washington, DC: Psychology Press.

*Prado, A. M., & Woodside, A. G. (2015). Deepening understanding of certification adoption and non-adoption of international-supplier ethical standards. Journal of Business Ethics, 132 (1), 105–125.

*Ralston, D. A., Egri, C. P., Furrer, O., Kuo, M. H., Li, Y., Wangenheim, F., et al. (2014). Societal-level versus individual-level predictions of ethical behavior: A 48-society study of collectivism and individualism. Journal of Business Ethics, 122 (2), 283–306.

*Rathner, S. (2013). The influence of primary study characteristics on the performance differential between socially responsible and conventional investment funds: A meta-analysis. Journal of Business Ethics, 118 (2), 349–363.

Rorty, R. (2009). Philosophy and the mirror of nature . Princeton: Princeton University Press.

Rose, N. (1985). The psychological complex . London: Routledge Kegan.

*Rousseau, D. M., Manning, J., & Denyer, D. (2008). Evidence in management and organizational science: Assembling the field’s full weight of scientific knowledge through syntheses. Academy of Management Annals, 2 (1), 475–515.

Russell, J., Greenhalgh, T., Byrne, E., & McDonnell, J. (2008). Recognizing rhetoric in health care policy analysis. Journal of Health Services Research and Policy, 13, 40–46.

Schön, D. A. (1992). The theory of inquiry: Dewey’s legacy to education. Curriculum Inquiry, 22 (2), 119–139.

Scott, J. C. (1998). Seeing like a state: How certain schemes to improve the human condition have failed . New Haven: Yale University Press.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference . New York: Wadsworth Cengage learning.

Shapin, S., & Schaffer, S. (1985). Leviathan and the air pump: Hobbes, Boyle and the experimental life . Princeton: Princeton University Press.

Singleton, V., & Law, J. (2013). Devices as rituals: Notes on enacting resistance. Journal of Cultural Economy, 6 (3), 259–277.

*Soares, C. (2003). Corporate versus individual moral responsibility. Journal of Business Ethics, 46 (2), 143–150.

Stone, D. A. (1989). Causal stories and the formation of policy agendas. Political Science Quarterly, 104 (2), 281–300.

Tuck, E., & McKenzie, M. (2015). Relational validity and the “where” of inquiry: Place and land in qualitative research. Qualitative Inquiry, 21 (7), 633–638.

Turker, D. (2009). Measuring corporate social responsibility: A scale development study. Journal of business ethics, 85 (4), 411–427.

Wasserman, L. (2013). All of statistics: A concise course in statistical inference . New York: Springer.

Werhane, P. H., & Freeman, R. E. (1999). Business ethics: The state of the art. International Journal of Management Reviews, 1 (1), 1–16.

Wicks, A. C., & Freeman, R. E. (1998). Organizational studies and the new pragmatism: Positivism, anti-positivism, and the search for ethics. Organization Science, 9, 123–140.

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data . Cambridge: MIT press.

Young, I. M. (2011). Justice and the politics of difference . Princeton: Princeton University Press.

Zyphur, M. J., Pierides, D. C., & Roffe, J. (2016a). Measurement and statistics in ‘organization science’: Philosophical, sociological, and historical perspectives. In R. Mir, H. Willmott, & M. Greenwood (Eds.), The Routledge companion to philosophy in organization studies (pp. 474–482). Abingdon: Routledge.

Zyphur, M. J., Zammuto, R. F., & Zhang, Z. (2016b). Multilevel latent polynomial regression for modeling (in) congruence across organizational groups: The case of organizational culture research. Organizational Research Methods, 19 (1), 53–79.

Download references

Acknowledgements

This research was supported by Australian Research Council’s Future Fellowship scheme (project FT140100629).

Author information

Authors and affiliations.

Department of Management and Marketing, University of Melbourne, Parkville, VIC, 3010, Australia

Michael J. Zyphur

Alliance Manchester Business School, University of Manchester, Manchester, UK

Dean C. Pierides

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Michael J. Zyphur .

Typical regression methods minimize the residual variance of outcome variables by predicting the mean (or statistical ‘expectation’) of an outcome. This can be shown by a simple regression model as follows:

wherein \(y_{i}\) is an outcome for some unit i , \(a\) is a regression intercept, \(\beta\) is a slope linking a predictor \(x_{i}\) to the outcome, and \(e_{i}\) is a residual. Typical regression assumptions pertain to \(e\) because this is parameterized as a random variable for estimation and inference, typically with a normal distribution such that:

wherein the residual variable has zero mean and variance \(\sigma^{2}\) .

However, if the outcome variable y is parameterized using the regression equation, the prediction of the outcome enters as the variable’s average. Specifically:

wherein all terms are as before, but the focus on the average of the outcome \(y\) at each level of the predictor \(x\) is clarified by showing how what is predicted are average levels of the outcomes \(y\) at different values of the predictor \(x\) .

The implication is that most regression methods implicitly assume that predicting averages are what is of greatest interest to researchers. With a focus on reducing errors in inference, the best way to do this probabilistically is to predict averages, but this is only true to the extent that a single numerical prediction of an assumedly homogenous group is desired based on the group’s average standing along an outcome \(y\) at a specific value of a predictor \(x\) . However, whether or not (and to what extent) averages may be relevant for a specific purpose and research orientation is typically left unclarified in QR, and we propose that this should be examined on a case-by-case basis with an eye to the ethics this or other QR practices.

Rights and permissions

Reprints and permissions

About this article

Zyphur, M.J., Pierides, D.C. Is Quantitative Research Ethical? Tools for Ethically Practicing, Evaluating, and Using Quantitative Research. J Bus Ethics 143 , 1–16 (2017). https://doi.org/10.1007/s10551-017-3549-8

Download citation

Received : 14 February 2017

Accepted : 17 April 2017

Published : 28 April 2017

Issue Date : June 2017

DOI : https://doi.org/10.1007/s10551-017-3549-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Quantitative research
  • Quantitative methods
  • Probability
  • Research design
  • Data analysis
  • Inductive inference

Advertisement

  • Find a journal
  • Publish with us
  • Track your research

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 21, Issue 4
  • How to appraise quantitative research
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

This article has a correction. Please see:

  • Correction: How to appraise quantitative research - April 01, 2019

Download PDF

  • Xabi Cathala 1 ,
  • Calvin Moorley 2
  • 1 Institute of Vocational Learning , School of Health and Social Care, London South Bank University , London , UK
  • 2 Nursing Research and Diversity in Care , School of Health and Social Care, London South Bank University , London , UK
  • Correspondence to Mr Xabi Cathala, Institute of Vocational Learning, School of Health and Social Care, London South Bank University London UK ; cathalax{at}lsbu.ac.uk and Dr Calvin Moorley, Nursing Research and Diversity in Care, School of Health and Social Care, London South Bank University, London SE1 0AA, UK; Moorleyc{at}lsbu.ac.uk

https://doi.org/10.1136/eb-2018-102996

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Some nurses feel that they lack the necessary skills to read a research paper and to then decide if they should implement the findings into their practice. This is particularly the case when considering the results of quantitative research, which often contains the results of statistical testing. However, nurses have a professional responsibility to critique research to improve their practice, care and patient safety. 1  This article provides a step by step guide on how to critically appraise a quantitative paper.

Title, keywords and the authors

The authors’ names may not mean much, but knowing the following will be helpful:

Their position, for example, academic, researcher or healthcare practitioner.

Their qualification, both professional, for example, a nurse or physiotherapist and academic (eg, degree, masters, doctorate).

This can indicate how the research has been conducted and the authors’ competence on the subject. Basically, do you want to read a paper on quantum physics written by a plumber?

The abstract is a resume of the article and should contain:

Introduction.

Research question/hypothesis.

Methods including sample design, tests used and the statistical analysis (of course! Remember we love numbers).

Main findings.

Conclusion.

The subheadings in the abstract will vary depending on the journal. An abstract should not usually be more than 300 words but this varies depending on specific journal requirements. If the above information is contained in the abstract, it can give you an idea about whether the study is relevant to your area of practice. However, before deciding if the results of a research paper are relevant to your practice, it is important to review the overall quality of the article. This can only be done by reading and critically appraising the entire article.

The introduction

Example: the effect of paracetamol on levels of pain.

My hypothesis is that A has an effect on B, for example, paracetamol has an effect on levels of pain.

My null hypothesis is that A has no effect on B, for example, paracetamol has no effect on pain.

My study will test the null hypothesis and if the null hypothesis is validated then the hypothesis is false (A has no effect on B). This means paracetamol has no effect on the level of pain. If the null hypothesis is rejected then the hypothesis is true (A has an effect on B). This means that paracetamol has an effect on the level of pain.

Background/literature review

The literature review should include reference to recent and relevant research in the area. It should summarise what is already known about the topic and why the research study is needed and state what the study will contribute to new knowledge. 5 The literature review should be up to date, usually 5–8 years, but it will depend on the topic and sometimes it is acceptable to include older (seminal) studies.

Methodology

In quantitative studies, the data analysis varies between studies depending on the type of design used. For example, descriptive, correlative or experimental studies all vary. A descriptive study will describe the pattern of a topic related to one or more variable. 6 A correlational study examines the link (correlation) between two variables 7  and focuses on how a variable will react to a change of another variable. In experimental studies, the researchers manipulate variables looking at outcomes 8  and the sample is commonly assigned into different groups (known as randomisation) to determine the effect (causal) of a condition (independent variable) on a certain outcome. This is a common method used in clinical trials.

There should be sufficient detail provided in the methods section for you to replicate the study (should you want to). To enable you to do this, the following sections are normally included:

Overview and rationale for the methodology.

Participants or sample.

Data collection tools.

Methods of data analysis.

Ethical issues.

Data collection should be clearly explained and the article should discuss how this process was undertaken. Data collection should be systematic, objective, precise, repeatable, valid and reliable. Any tool (eg, a questionnaire) used for data collection should have been piloted (or pretested and/or adjusted) to ensure the quality, validity and reliability of the tool. 9 The participants (the sample) and any randomisation technique used should be identified. The sample size is central in quantitative research, as the findings should be able to be generalised for the wider population. 10 The data analysis can be done manually or more complex analyses performed using computer software sometimes with advice of a statistician. From this analysis, results like mode, mean, median, p value, CI and so on are always presented in a numerical format.

The author(s) should present the results clearly. These may be presented in graphs, charts or tables alongside some text. You should perform your own critique of the data analysis process; just because a paper has been published, it does not mean it is perfect. Your findings may be different from the author’s. Through critical analysis the reader may find an error in the study process that authors have not seen or highlighted. These errors can change the study result or change a study you thought was strong to weak. To help you critique a quantitative research paper, some guidance on understanding statistical terminology is provided in  table 1 .

  • View inline

Some basic guidance for understanding statistics

Quantitative studies examine the relationship between variables, and the p value illustrates this objectively.  11  If the p value is less than 0.05, the null hypothesis is rejected and the hypothesis is accepted and the study will say there is a significant difference. If the p value is more than 0.05, the null hypothesis is accepted then the hypothesis is rejected. The study will say there is no significant difference. As a general rule, a p value of less than 0.05 means, the hypothesis is accepted and if it is more than 0.05 the hypothesis is rejected.

The CI is a number between 0 and 1 or is written as a per cent, demonstrating the level of confidence the reader can have in the result. 12  The CI is calculated by subtracting the p value to 1 (1–p). If there is a p value of 0.05, the CI will be 1–0.05=0.95=95%. A CI over 95% means, we can be confident the result is statistically significant. A CI below 95% means, the result is not statistically significant. The p values and CI highlight the confidence and robustness of a result.

Discussion, recommendations and conclusion

The final section of the paper is where the authors discuss their results and link them to other literature in the area (some of which may have been included in the literature review at the start of the paper). This reminds the reader of what is already known, what the study has found and what new information it adds. The discussion should demonstrate how the authors interpreted their results and how they contribute to new knowledge in the area. Implications for practice and future research should also be highlighted in this section of the paper.

A few other areas you may find helpful are:

Limitations of the study.

Conflicts of interest.

Table 2 provides a useful tool to help you apply the learning in this paper to the critiquing of quantitative research papers.

Quantitative paper appraisal checklist

  • 1. ↵ Nursing and Midwifery Council , 2015 . The code: standard of conduct, performance and ethics for nurses and midwives https://www.nmc.org.uk/globalassets/sitedocuments/nmc-publications/nmc-code.pdf ( accessed 21.8.18 ).
  • Gerrish K ,
  • Moorley C ,
  • Tunariu A , et al
  • Shorten A ,

Competing interests None declared.

Patient consent Not required.

Provenance and peer review Commissioned; internally peer reviewed.

Correction notice This article has been updated since its original publication to update p values from 0.5 to 0.05 throughout.

Linked Articles

  • Miscellaneous Correction: How to appraise quantitative research BMJ Publishing Group Ltd and RCN Publishing Company Ltd Evidence-Based Nursing 2019; 22 62-62 Published Online First: 31 Jan 2019. doi: 10.1136/eb-2018-102996corr1

Read the full text or download the PDF:

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Descriptive Statistics for Summarising Data

Ray w. cooksey.

UNE Business School, University of New England, Armidale, NSW Australia

This chapter discusses and illustrates descriptive statistics . The purpose of the procedures and fundamental concepts reviewed in this chapter is quite straightforward: to facilitate the description and summarisation of data. By ‘describe’ we generally mean either the use of some pictorial or graphical representation of the data (e.g. a histogram, box plot, radar plot, stem-and-leaf display, icon plot or line graph) or the computation of an index or number designed to summarise a specific characteristic of a variable or measurement (e.g., frequency counts, measures of central tendency, variability, standard scores). Along the way, we explore the fundamental concepts of probability and the normal distribution. We seldom interpret individual data points or observations primarily because it is too difficult for the human brain to extract or identify the essential nature, patterns, or trends evident in the data, particularly if the sample is large. Rather we utilise procedures and measures which provide a general depiction of how the data are behaving. These statistical procedures are designed to identify or display specific patterns or trends in the data. What remains after their application is simply for us to interpret and tell the story.

The first broad category of statistics we discuss concerns descriptive statistics . The purpose of the procedures and fundamental concepts in this category is quite straightforward: to facilitate the description and summarisation of data. By ‘describe’ we generally mean either the use of some pictorial or graphical representation of the data or the computation of an index or number designed to summarise a specific characteristic of a variable or measurement.

We seldom interpret individual data points or observations primarily because it is too difficult for the human brain to extract or identify the essential nature, patterns, or trends evident in the data, particularly if the sample is large. Rather we utilise procedures and measures which provide a general depiction of how the data are behaving. These statistical procedures are designed to identify or display specific patterns or trends in the data. What remains after their application is simply for us to interpret and tell the story.

Reflect on the QCI research scenario and the associated data set discussed in Chap. 10.1007/978-981-15-2537-7_4. Consider the following questions that Maree might wish to address with respect to decision accuracy and speed scores:

  • What was the typical level of accuracy and decision speed for inspectors in the sample? [see Procedure 5.4 – Assessing central tendency.]
  • What was the most common accuracy and speed score amongst the inspectors? [see Procedure 5.4 – Assessing central tendency.]
  • What was the range of accuracy and speed scores; the lowest and the highest scores? [see Procedure 5.5 – Assessing variability.]
  • How frequently were different levels of inspection accuracy and speed observed? What was the shape of the distribution of inspection accuracy and speed scores? [see Procedure 5.1 – Frequency tabulation, distributions & crosstabulation.]
  • What percentage of inspectors would have ‘failed’ to ‘make the cut’ assuming the industry standard for acceptable inspection accuracy and speed combined was set at 95%? [see Procedure 5.7 – Standard ( z ) scores.]
  • How variable were the inspectors in their accuracy and speed scores? Were all the accuracy and speed levels relatively close to each other in magnitude or were the scores widely spread out over the range of possible test outcomes? [see Procedure 5.5 – Assessing variability.]
  • What patterns might be visually detected when looking at various QCI variables singly and together as a set? [see Procedure 5.2 – Graphical methods for dispaying data, Procedure 5.3 – Multivariate graphs & displays, and Procedure 5.6 – Exploratory data analysis.]

This chapter includes discussions and illustrations of a number of procedures available for answering questions about data like those posed above. In addition, you will find discussions of two fundamental concepts, namely probability and the normal distribution ; concepts that provide building blocks for Chaps. 10.1007/978-981-15-2537-7_6 and 10.1007/978-981-15-2537-7_7.

Procedure 5.1: Frequency Tabulation, Distributions & Crosstabulation

Frequency tabulation and distributions.

Frequency tabulation serves to provide a convenient counting summary for a set of data that facilitates interpretation of various aspects of those data. Basically, frequency tabulation occurs in two stages:

  • First, the scores in a set of data are rank ordered from the lowest value to the highest value.
  • Second, the number of times each specific score occurs in the sample is counted. This count records the frequency of occurrence for that specific data value.

Consider the overall job satisfaction variable, jobsat , from the QCI data scenario. Performing frequency tabulation across the 112 Quality Control Inspectors on this variable using the SPSS Frequencies procedure (Allen et al. 2019 , ch. 3; George and Mallery 2019 , ch. 6) produces the frequency tabulation shown in Table 5.1 . Note that three of the inspectors in the sample did not provide a rating for jobsat thereby producing three missing values (= 2.7% of the sample of 112) and leaving 109 inspectors with valid data for the analysis.

Frequency tabulation of overall job satisfaction scores

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab1_HTML.jpg

The display of frequency tabulation is often referred to as the frequency distribution for the sample of scores. For each value of a variable, the frequency of its occurrence in the sample of data is reported. It is possible to compute various percentages and percentile values from a frequency distribution.

Table 5.1 shows the ‘Percent’ or relative frequency of each score (the percentage of the 112 inspectors obtaining each score, including those inspectors who were missing scores, which SPSS labels as ‘System’ missing). Table 5.1 also shows the ‘Valid Percent’ which is computed only for those inspectors in the sample who gave a valid or non-missing response.

Finally, it is possible to add up the ‘Valid Percent’ values, starting at the low score end of the distribution, to form the cumulative distribution or ‘Cumulative Percent’ . A cumulative distribution is useful for finding percentiles which reflect what percentage of the sample scored at a specific value or below.

We can see in Table 5.1 that 4 of the 109 valid inspectors (a ‘Valid Percent’ of 3.7%) indicated the lowest possible level of job satisfaction—a value of 1 (Very Low) – whereas 18 of the 109 valid inspectors (a ‘Valid Percent’ of 16.5%) indicated the highest possible level of job satisfaction—a value of 7 (Very High). The ‘Cumulative Percent’ number of 18.3 in the row for the job satisfaction score of 3 can be interpreted as “roughly 18% of the sample of inspectors reported a job satisfaction score of 3 or less”; that is, nearly a fifth of the sample expressed some degree of negative satisfaction with their job as a quality control inspector in their particular company.

If you have a large data set having many different scores for a particular variable, it may be more useful to tabulate frequencies on the basis of intervals of scores.

For the accuracy scores in the QCI database, you could count scores occurring in intervals such as ‘less than 75% accuracy’, ‘between 75% but less than 85% accuracy’, ‘between 85% but less than 95% accuracy’, and ‘95% accuracy or greater’, rather than counting the individual scores themselves. This would yield what is termed a ‘grouped’ frequency distribution since the data have been grouped into intervals or score classes. Producing such an analysis using SPSS would involve extra steps to create the new category or ‘grouping’ system for scores prior to conducting the frequency tabulation.

Crosstabulation

In a frequency crosstabulation , we count frequencies on the basis of two variables simultaneously rather than one; thus we have a bivariate situation.

For example, Maree might be interested in the number of male and female inspectors in the sample of 112 who obtained each jobsat score. Here there are two variables to consider: inspector’s gender and inspector’s j obsat score. Table 5.2 shows such a crosstabulation as compiled by the SPSS Crosstabs procedure (George and Mallery 2019 , ch. 8). Note that inspectors who did not report a score for jobsat and/or gender have been omitted as missing values, leaving 106 valid inspectors for the analysis.

Frequency crosstabulation of jobsat scores by gender category for the QCI data

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab2_HTML.jpg

The crosstabulation shown in Table 5.2 gives a composite picture of the distribution of satisfaction levels for male inspectors and for female inspectors. If frequencies or ‘Counts’ are added across the gender categories, we obtain the numbers in the ‘Total’ column (the percentages or relative frequencies are also shown immediately below each count) for each discrete value of jobsat (note this column of statistics differs from that in Table 5.1 because the gender variable was missing for certain inspectors). By adding down each gender column, we obtain, in the bottom row labelled ‘Total’, the number of males and the number of females that comprised the sample of 106 valid inspectors.

The totals, either across the rows or down the columns of the crosstabulation, are termed the marginal distributions of the table. These marginal distributions are equivalent to frequency tabulations for each of the variables jobsat and gender . As with frequency tabulation, various percentage measures can be computed in a crosstabulation, including the percentage of the sample associated with a specific count within either a row (‘% within jobsat ’) or a column (‘% within gender ’). You can see in Table 5.2 that 18 inspectors indicated a job satisfaction level of 7 (Very High); of these 18 inspectors reported in the ‘Total’ column, 8 (44.4%) were male and 10 (55.6%) were female. The marginal distribution for gender in the ‘Total’ row shows that 57 inspectors (53.8% of the 106 valid inspectors) were male and 49 inspectors (46.2%) were female. Of the 57 male inspectors in the sample, 8 (14.0%) indicated a job satisfaction level of 7 (Very High). Furthermore, we could generate some additional interpretive information of value by adding the ‘% within gender’ values for job satisfaction levels of 5, 6 and 7 (i.e. differing degrees of positive job satisfaction). Here we would find that 68.4% (= 24.6% + 29.8% + 14.0%) of male inspectors indicated some degree of positive job satisfaction compared to 61.2% (= 10.2% + 30.6% + 20.4%) of female inspectors.

This helps to build a picture of the possible relationship between an inspector’s gender and their level of job satisfaction (a relationship that, as we will see later, can be quantified and tested using Procedure 10.1007/978-981-15-2537-7_6#Sec14 and Procedure 10.1007/978-981-15-2537-7_7#Sec17).

It should be noted that a crosstabulation table such as that shown in Table 5.2 is often referred to as a contingency table about which more will be said later (see Procedure 10.1007/978-981-15-2537-7_7#Sec17 and Procedure 10.1007/978-981-15-2537-7_7#Sec115).

Frequency tabulation is useful for providing convenient data summaries which can aid in interpreting trends in a sample, particularly where the number of discrete values for a variable is relatively small. A cumulative percent distribution provides additional interpretive information about the relative positioning of specific scores within the overall distribution for the sample.

Crosstabulation permits the simultaneous examination of the distributions of values for two variables obtained from the same sample of observations. This examination can yield some useful information about the possible relationship between the two variables. More complex crosstabulations can be also done where the values of three or more variables are tracked in a single systematic summary. The use of frequency tabulation or cross-tabulation in conjunction with various other statistical measures, such as measures of central tendency (see Procedure 5.4 ) and measures of variability (see Procedure 5.5 ), can provide a relatively complete descriptive summary of any data set.

Disadvantages

Frequency tabulations can get messy if interval or ratio-level measures are tabulated simply because of the large number of possible data values. Grouped frequency distributions really should be used in such cases. However, certain choices, such as the size of the score interval (group size), must be made, often arbitrarily, and such choices can affect the nature of the final frequency distribution.

Additionally, percentage measures have certain problems associated with them, most notably, the potential for their misinterpretation in small samples. One should be sure to know the sample size on which percentage measures are based in order to obtain an interpretive reference point for the actual percentage values.

For example

In a sample of 10 individuals, 20% represents only two individuals whereas in a sample of 300 individuals, 20% represents 60 individuals. If all that is reported is the 20%, then the mental inference drawn by readers is likely to be that a sizeable number of individuals had a score or scores of a particular value—but what is ‘sizeable’ depends upon the total number of observations on which the percentage is based.

Where Is This Procedure Useful?

Frequency tabulation and crosstabulation are very commonly applied procedures used to summarise information from questionnaires, both in terms of tabulating various demographic characteristics (e.g. gender, age, education level, occupation) and in terms of actual responses to questions (e.g. numbers responding ‘yes’ or ‘no’ to a particular question). They can be particularly useful in helping to build up the data screening and demographic stories discussed in Chap. 10.1007/978-981-15-2537-7_4. Categorical data from observational studies can also be analysed with this technique (e.g. the number of times Suzy talks to Frank, to Billy, and to John in a study of children’s social interactions).

Certain types of experimental research designs may also be amenable to analysis by crosstabulation with a view to drawing inferences about distribution differences across the sets of categories for the two variables being tracked.

You could employ crosstabulation in conjunction with the tests described in Procedure 10.1007/978-981-15-2537-7_7#Sec17 to see if two different styles of advertising campaign differentially affect the product purchasing patterns of male and female consumers.

In the QCI database, Maree could employ crosstabulation to help her answer the question “do different types of electronic manufacturing firms ( company ) differ in terms of their tendency to employ male versus female quality control inspectors ( gender )?”

Software Procedures

Procedure 5.2: graphical methods for displaying data.

Graphical methods for displaying data include bar and pie charts, histograms and frequency polygons, line graphs and scatterplots. It is important to note that what is presented here is a small but representative sampling of the types of simple graphs one can produce to summarise and display trends in data. Generally speaking, SPSS offers the easiest facility for producing and editing graphs, but with a rather limited range of styles and types. SYSTAT, STATGRAPHICS and NCSS offer a much wider range of graphs (including graphs unique to each package), but with the drawback that it takes somewhat more effort to get the graphs in exactly the form you want.

Bar and Pie Charts

These two types of graphs are useful for summarising the frequency of occurrence of various values (or ranges of values) where the data are categorical (nominal or ordinal level of measurement).

  • A bar chart uses vertical and horizontal axes to summarise the data. The vertical axis is used to represent frequency (number) of occurrence or the relative frequency (percentage) of occurrence; the horizontal axis is used to indicate the data categories of interest.
  • A pie chart gives a simpler visual representation of category frequencies by cutting a circular plot into wedges or slices whose sizes are proportional to the relative frequency (percentage) of occurrence of specific data categories. Some pie charts can have a one or more slices emphasised by ‘exploding’ them out from the rest of the pie.

Consider the company variable from the QCI database. This variable depicts the types of manufacturing firms that the quality control inspectors worked for. Figure 5.1 illustrates a bar chart summarising the percentage of female inspectors in the sample coming from each type of firm. Figure 5.2 shows a pie chart representation of the same data, with an ‘exploded slice’ highlighting the percentage of female inspectors in the sample who worked for large business computer manufacturers – the lowest percentage of the five types of companies. Both graphs were produced using SPSS.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig1_HTML.jpg

Bar chart: Percentage of female inspectors

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig2_HTML.jpg

Pie chart: Percentage of female inspectors

The pie chart was modified with an option to show the actual percentage along with the label for each category. The bar chart shows that computer manufacturing firms have relatively fewer female inspectors compared to the automotive and electrical appliance (large and small) firms. This trend is less clear from the pie chart which suggests that pie charts may be less visually interpretable when the data categories occur with rather similar frequencies. However, the ‘exploded slice’ option can help interpretation in some circumstances.

Certain software programs, such as SPSS, STATGRAPHICS, NCSS and Microsoft Excel, offer the option of generating 3-dimensional bar charts and pie charts and incorporating other ‘bells and whistles’ that can potentially add visual richness to the graphic representation of the data. However, you should generally be careful with these fancier options as they can produce distortions and create ambiguities in interpretation (e.g. see discussions in Jacoby 1997 ; Smithson 2000 ; Wilkinson 2009 ). Such distortions and ambiguities could ultimately end up providing misinformation to researchers as well as to those who read their research.

Histograms and Frequency Polygons

These two types of graphs are useful for summarising the frequency of occurrence of various values (or ranges of values) where the data are essentially continuous (interval or ratio level of measurement) in nature. Both histograms and frequency polygons use vertical and horizontal axes to summarise the data. The vertical axis is used to represent the frequency (number) of occurrence or the relative frequency (percentage) of occurrences; the horizontal axis is used for the data values or ranges of values of interest. The histogram uses bars of varying heights to depict frequency; the frequency polygon uses lines and points.

There is a visual difference between a histogram and a bar chart: the bar chart uses bars that do not physically touch, signifying the discrete and categorical nature of the data, whereas the bars in a histogram physically touch to signal the potentially continuous nature of the data.

Suppose Maree wanted to graphically summarise the distribution of speed scores for the 112 inspectors in the QCI database. Figure 5.3 (produced using NCSS) illustrates a histogram representation of this variable. Figure 5.3 also illustrates another representational device called the ‘density plot’ (the solid tracing line overlaying the histogram) which gives a smoothed impression of the overall shape of the distribution of speed scores. Figure 5.4 (produced using STATGRAPHICS) illustrates the frequency polygon representation for the same data.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig3_HTML.jpg

Histogram of the speed variable (with density plot overlaid)

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig4_HTML.jpg

Frequency polygon plot of the speed variable

These graphs employ a grouped format where speed scores which fall within specific intervals are counted as being essentially the same score. The shape of the data distribution is reflected in these plots. Each graph tells us that the inspection speed scores are positively skewed with only a few inspectors taking very long times to make their inspection judgments and the majority of inspectors taking rather shorter amounts of time to make their decisions.

Both representations tell a similar story; the choice between them is largely a matter of personal preference. However, if the number of bars to be plotted in a histogram is potentially very large (and this is usually directly controllable in most statistical software packages), then a frequency polygon would be the preferred representation simply because the amount of visual clutter in the graph will be much reduced.

It is somewhat of an art to choose an appropriate definition for the width of the score grouping intervals (or ‘bins’ as they are often termed) to be used in the plot: choose too many and the plot may look too lumpy and the overall distributional trend may not be obvious; choose too few and the plot will be too coarse to give a useful depiction. Programs like SPSS, SYSTAT, STATGRAPHICS and NCSS are designed to choose an ‘appropriate’ number of bins to be used, but the analyst’s eye is often a better judge than any statistical rule that a software package would use.

There are several interesting variations of the histogram which can highlight key data features or facilitate interpretation of certain trends in the data. One such variation is a graph is called a dual histogram (available in SYSTAT; a variation called a ‘comparative histogram’ can be created in NCSS) – a graph that facilitates visual comparison of the frequency distributions for a specific variable for participants from two distinct groups.

Suppose Maree wanted to graphically compare the distributions of speed scores for inspectors in the two categories of education level ( educlev ) in the QCI database. Figure 5.5 shows a dual histogram (produced using SYSTAT) that accomplishes this goal. This graph still employs the grouped format where speed scores falling within particular intervals are counted as being essentially the same score. The shape of the data distribution within each group is also clearly reflected in this plot. However, the story conveyed by the dual histogram is that, while the inspection speed scores are positively skewed for inspectors in both categories of educlev, the comparison suggests that inspectors with a high school level of education (= 1) tend to take slightly longer to make their inspection decisions than do their colleagues who have a tertiary qualification (= 2).

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig5_HTML.jpg

Dual histogram of speed for the two categories of educlev

Line Graphs

The line graph is similar in style to the frequency polygon but is much more general in its potential for summarising data. In a line graph, we seldom deal with percentage or frequency data. Instead we can summarise other types of information about data such as averages or means (see Procedure 5.4 for a discussion of this measure), often for different groups of participants. Thus, one important use of the line graph is to break down scores on a specific variable according to membership in the categories of a second variable.

In the context of the QCI database, Maree might wish to summarise the average inspection accuracy scores for the inspectors from different types of manufacturing companies. Figure 5.6 was produced using SPSS and shows such a line graph.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig6_HTML.jpg

Line graph comparison of companies in terms of average inspection accuracy

Note how the trend in performance across the different companies becomes clearer with such a visual representation. It appears that the inspectors from the Large Business Computer and PC manufacturing companies have better average inspection accuracy compared to the inspectors from the remaining three industries.

With many software packages, it is possible to further elaborate a line graph by including error or confidence intervals bars (see Procedure 10.1007/978-981-15-2537-7_8#Sec18). These give some indication of the precision with which the average level for each category in the population has been estimated (narrow bars signal a more precise estimate; wide bars signal a less precise estimate).

Figure 5.7 shows such an elaborated line graph, using 95% confidence interval bars, which can be used to help make more defensible judgments (compared to Fig. 5.6 ) about whether the companies are substantively different from each other in average inspection performance. Companies whose confidence interval bars do not overlap each other can be inferred to be substantively different in performance characteristics.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig7_HTML.jpg

Line graph using confidence interval bars to compare accuracy across companies

The accuracy confidence interval bars for participants from the Large Business Computer manufacturing firms do not overlap those from the Large or Small Electrical Appliance manufacturers or the Automobile manufacturers.

We might conclude that quality control inspection accuracy is substantially better in the Large Business Computer manufacturing companies than in these other industries but is not substantially better than the PC manufacturing companies. We might also conclude that inspection accuracy in PC manufacturing companies is not substantially different from Small Electrical Appliance manufacturers.

Scatterplots

Scatterplots are useful in displaying the relationship between two interval- or ratio-scaled variables or measures of interest obtained on the same individuals, particularly in correlational research (see Fundamental Concept 10.1007/978-981-15-2537-7_6#Sec1 and Procedure 10.1007/978-981-15-2537-7_6#Sec4).

In a scatterplot, one variable is chosen to be represented on the horizontal axis; the second variable is represented on the vertical axis. In this type of plot, all data point pairs in the sample are graphed. The shape and tilt of the cloud of points in a scatterplot provide visual information about the strength and direction of the relationship between the two variables. A very compact elliptical cloud of points signals a strong relationship; a very loose or nearly circular cloud signals a weak or non-existent relationship. A cloud of points generally tilted upward toward the right side of the graph signals a positive relationship (higher scores on one variable associated with higher scores on the other and vice-versa). A cloud of points generally tilted downward toward the right side of the graph signals a negative relationship (higher scores on one variable associated with lower scores on the other and vice-versa).

Maree might be interested in displaying the relationship between inspection accuracy and inspection speed in the QCI database. Figure 5.8 , produced using SPSS, shows what such a scatterplot might look like. Several characteristics of the data for these two variables can be noted in Fig. 5.8 . The shape of the distribution of data points is evident. The plot has a fan-shaped characteristic to it which indicates that accuracy scores are highly variable (exhibit a very wide range of possible scores) at very fast inspection speeds but get much less variable and tend to be somewhat higher as inspection speed increases (where inspectors take longer to make their quality control decisions). Thus, there does appear to be some relationship between inspection accuracy and inspection speed (a weak positive relationship since the cloud of points tends to be very loose but tilted generally upward toward the right side of the graph – slower speeds tend to be slightly associated with higher accuracy.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig8_HTML.jpg

Scatterplot relating inspection accuracy to inspection speed

However, it is not the case that the inspection decisions which take longest to make are necessarily the most accurate (see the labelled points for inspectors 7 and 62 in Fig. 5.8 ). Thus, Fig. 5.8 does not show a simple relationship that can be unambiguously summarised by a statement like “the longer an inspector takes to make a quality control decision, the more accurate that decision is likely to be”. The story is more complicated.

Some software packages, such as SPSS, STATGRAPHICS and SYSTAT, offer the option of using different plotting symbols or markers to represent the members of different groups so that the relationship between the two focal variables (the ones anchoring the X and Y axes) can be clarified with reference to a third categorical measure.

Maree might want to see if the relationship depicted in Fig. 5.8 changes depending upon whether the inspector was tertiary-qualified or not (this information is represented in the educlev variable of the QCI database).

Figure 5.9 shows what such a modified scatterplot might look like; the legend in the upper corner of the figure defines the marker symbols for each category of the educlev variable. Note that for both High School only-educated inspectors and Tertiary-qualified inspectors, the general fan-shaped relationship between accuracy and speed is the same. However, it appears that the distribution of points for the High School only-educated inspectors is shifted somewhat upward and toward the right of the plot suggesting that these inspectors tend to be somewhat more accurate as well as slower in their decision processes.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig9_HTML.jpg

Scatterplot displaying accuracy vs speed conditional on educlev group

There are many other styles of graphs available, often dependent upon the specific statistical package you are using. Interestingly, NCSS and, particularly, SYSTAT and STATGRAPHICS, appear to offer the most variety in terms of types of graphs available for visually representing data. A reading of the user’s manuals for these programs (see the Useful additional readings) would expose you to the great diversity of plotting techniques available to researchers. Many of these techniques go by rather interesting names such as: Chernoff’s faces, radar plots, sunflower plots, violin plots, star plots, Fourier blobs, and dot plots.

These graphical methods provide summary techniques for visually presenting certain characteristics of a set of data. Visual representations are generally easier to understand than a tabular representation and when these plots are combined with available numerical statistics, they can give a very complete picture of a sample of data. Newer methods have become available which permit more complex representations to be depicted, opening possibilities for creatively visually representing more aspects and features of the data (leading to a style of visual data storytelling called infographics ; see, for example, McCandless 2014 ; Toseland and Toseland 2012 ). Many of these newer methods can display data patterns from multiple variables in the same graph (several of these newer graphical methods are illustrated and discussed in Procedure 5.3 ).

Graphs tend to be cumbersome and space consuming if a great many variables need to be summarised. In such cases, using numerical summary statistics (such as means or correlations) in tabular form alone will provide a more economical and efficient summary. Also, it can be very easy to give a misleading picture of data trends using graphical methods by simply choosing the ‘correct’ scaling for maximum effect or choosing a display option (such as a 3-D effect) that ‘looks’ presentable but which actually obscures a clear interpretation (see Smithson 2000 ; Wilkinson 2009 ).

Thus, you must be careful in creating and interpreting visual representations so that the influence of aesthetic choices for sake of appearance do not become more important than obtaining a faithful and valid representation of the data—a very real danger with many of today’s statistical packages where ‘default’ drawing options have been pre-programmed in. No single plot can completely summarise all possible characteristics of a sample of data. Thus, choosing a specific method of graphical display may, of necessity, force a behavioural researcher to represent certain data characteristics (such as frequency) at the expense of others (such as averages).

Virtually any research design which produces quantitative data and statistics (even to the extent of just counting the number of occurrences of several events) provides opportunities for graphical data display which may help to clarify or illustrate important data characteristics or relationships. Remember, graphical displays are communication tools just like numbers—which tool to choose depends upon the message to be conveyed. Visual representations of data are generally more useful in communicating to lay persons who are unfamiliar with statistics. Care must be taken though as these same lay people are precisely the people most likely to misinterpret a graph if it has been incorrectly drawn or scaled.

Procedure 5.3: Multivariate Graphs & Displays

Graphical methods for displaying multivariate data (i.e. many variables at once) include scatterplot matrices, radar (or spider) plots, multiplots, parallel coordinate displays, and icon plots. Multivariate graphs are useful for visualising broad trends and patterns across many variables (Cleveland 1995 ; Jacoby 1998 ). Such graphs typically sacrifice precision in representation in favour of a snapshot pictorial summary that can help you form general impressions of data patterns.

It is important to note that what is presented here is a small but reasonably representative sampling of the types of graphs one can produce to summarise and display trends in multivariate data. Generally speaking, SYSTAT offers the best facilities for producing multivariate graphs, followed by STATGRAPHICS, but with the drawback that it is somewhat tricky to get the graphs in exactly the form you want. SYSTAT also has excellent facilities for creating new forms and combinations of graphs – essentially allowing graphs to be tailor-made for a specific communication purpose. Both SPSS and NCSS offer a more limited range of multivariate graphs, generally restricted to scatterplot matrices and variations of multiplots. Microsoft Excel or STATGRAPHICS are the packages to use if radar or spider plots are desired.

Scatterplot Matrices

A scatterplot matrix is a useful multivariate graph designed to show relationships between pairs of many variables in the same display.

Figure 5.10 illustrates a scatterplot matrix, produced using SYSTAT, for the mentabil , accuracy , speed , jobsat and workcond variables in the QCI database. It is easy to see that all the scatterplot matrix does is stack all pairs of scatterplots into a format where it is easy to pick out the graph for any ‘row’ variable that intersects a column ‘variable’.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig10_HTML.jpg

Scatterplot matrix relating mentabil , accuracy , speed , jobsat & workcond

In those plots where a ‘row’ variable intersects itself in a column of the matrix (along the so-called ‘diagonal’), SYSTAT permits a range of univariate displays to be shown. Figure 5.10 shows univariate histograms for each variable (recall Procedure 5.2 ). One obvious drawback of the scatterplot matrix is that, if many variables are to be displayed (say ten or more); the graph gets very crowded and becomes very hard to visually appreciate.

Looking at the first column of graphs in Fig. 5.10 , we can see the scatterplot relationships between mentabil and each of the other variables. We can get a visual impression that mentabil seems to be slightly negatively related to accuracy (the cloud of scatter points tends to angle downward to the right, suggesting, very slightly, that higher mentabil scores are associated with lower levels of accuracy ).

Conversely, the visual impression of the relationship between mentabil and speed is that the relationship is slightly positive (higher mentabil scores tend to be associated with higher speed scores = longer inspection times). Similar types of visual impressions can be formed for other parts of Fig. 5.10 . Notice that the histogram plots along the diagonal give a clear impression of the shape of the distribution for each variable.

Radar Plots

The radar plot (also known as a spider graph for obvious reasons) is a simple and effective device for displaying scores on many variables. Microsoft Excel offers a range of options and capabilities for producing radar plots, such as the plot shown in Fig. 5.11 . Radar plots are generally easy to interpret and provide a good visual basis for comparing plots from different individuals or groups, even if a fairly large number of variables (say, up to about 25) are being displayed. Like a clock face, variables are evenly spaced around the centre of the plot in clockwise order starting at the 12 o’clock position. Visual interpretation of a radar plot primarily relies on shape comparisons, i.e. the rise and fall of peaks and valleys along the spokes around the plot. Valleys near the centre display low scores on specific variables, peaks near the outside of the plot display high scores on specific variables. [Note that, technically, radar plots employ polar coordinates.] SYSTAT can draw graphs using polar coordinates but not as easily as Excel can, from the user’s perspective. Radar plots work best if all the variables represented are measured on the same scale (e.g. a 1 to 7 Likert-type scale or 0% to 100% scale). Individuals who are missing any scores on the variables being plotted are typically omitted.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig11_HTML.jpg

Radar plot comparing attitude ratings for inspectors 66 and 104

The radar plot in Fig. 5.11 , produced using Excel, compares two specific inspectors, 66 and 104, on the nine attitude rating scales. Inspector 66 gave the highest rating (= 7) on the cultqual variable and inspector 104 gave the lowest rating (= 1). The plot shows that inspector 104 tended to provide very low ratings on all nine attitude variables, whereas inspector 66 tended to give very high ratings on all variables except acctrain and trainapp , where the scores were similar to those for inspector 104. Thus, in general, inspector 66 tended to show much more positive attitudes toward their workplace compared to inspector 104.

While Fig. 5.11 was generated to compare the scores for two individuals in the QCI database, it would be just as easy to produce a radar plot that compared the five types of companies in terms of their average ratings on the nine variables, as shown in Fig. 5.12 .

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig12_HTML.jpg

Radar plot comparing average attitude ratings for five types of company

Here we can form the visual impression that the five types of companies differ most in their average ratings of mgmtcomm and least in the average ratings of polsatis . Overall, the average ratings from inspectors from PC manufacturers (black diamonds with solid lines) seem to be generally the most positive as their scores lie on or near the outer ring of scores and those from Automobile manufacturers tend to be least positive on many variables (except the training-related variables).

Extrapolating from Fig. 5.12 , you may rightly conclude that including too many groups and/or too many variables in a radar plot comparison can lead to so much clutter that any visual comparison would be severely degraded. You may have to experiment with using colour-coded lines to represent different groups versus line and marker shape variations (as used in Fig. 5.12 ), because choice of coding method for groups can influence the interpretability of a radar plot.

A multiplot is simply a hybrid style of graph that can display group comparisons across a number of variables. There are a wide variety of possible multiplots one could potentially design (SYSTAT offers great capabilities with respect to multiplots). Figure 5.13 shows a multiplot comprising a side-by-side series of profile-based line graphs – one graph for each type of company in the QCI database.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig13_HTML.jpg

Multiplot comparing profiles of average attitude ratings for five company types

The multiplot in Fig. 5.13 , produced using SYSTAT, graphs the profile of average attitude ratings for all inspectors within a specific type of company. This multiplot shows the same story as the radar plot in Fig. 5.12 , but in a different graphical format. It is still fairly clear that the average ratings from inspectors from PC manufacturers tend to be higher than for the other types of companies and the profile for inspectors from automobile manufacturers tends to be lower than for the other types of companies.

The profile for inspectors from large electrical appliance manufacturers is the flattest, meaning that their average attitude ratings were less variable than for other types of companies. Comparing the ease with which you can glean the visual impressions from Figs. 5.12 and 5.13 may lead you to prefer one style of graph over another. If you have such preferences, chances are others will also, which may mean you need to carefully consider your options when deciding how best to display data for effect.

Frequently, choice of graph is less a matter of which style is right or wrong, but more a matter of which style will suit specific purposes or convey a specific story, i.e. the choice is often strategic.

Parallel Coordinate Displays

A parallel coordinate display is useful for displaying individual scores on a range of variables, all measured using the same scale. Furthermore, such graphs can be combined side-by-side to facilitate very broad visual comparisons among groups, while retaining individual profile variability in scores. Each line in a parallel coordinate display represents one individual, e.g. an inspector.

The interpretation of a parallel coordinate display, such as the two shown in Fig. 5.14 , depends on visual impressions of the peaks and valleys (highs and lows) in the profiles as well as on the density of similar profile lines. The graph is called ‘parallel coordinate’ simply because it assumes that all variables are measured on the same scale and that scores for each variable can therefore be located along vertical axes that are parallel to each other (imagine vertical lines on Fig. 5.14 running from bottom to top for each variable on the X-axis). The main drawback of this method of data display is that only those individuals in the sample who provided legitimate scores on all of the variables being plotted (i.e. who have no missing scores) can be displayed.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig14_HTML.jpg

Parallel coordinate displays comparing profiles of average attitude ratings for five company types

The parallel coordinate display in Fig. 5.14 , produced using SYSTAT, graphs the profile of average attitude ratings for all inspectors within two specific types of company: the left graph for inspectors from PC manufacturers and the right graph for automobile manufacturers.

There are fewer lines in each display than the number of inspectors from each type of company simply because several inspectors from each type of company were missing a rating on at least one of the nine attitude variables. The graphs show great variability in scores amongst inspectors within a company type, but there are some overall patterns evident.

For example, inspectors from automobile companies clearly and fairly uniformly rated mgmtcomm toward the low end of the scale, whereas the reverse was generally true for that variable for inspectors from PC manufacturers. Conversely, inspectors from automobile companies tend to rate acctrain and trainapp more toward the middle to high end of the scale, whereas the reverse is generally true for those variables for inspectors from PC manufacturers.

Perhaps the most creative types of multivariate displays are the so-called icon plots . SYSTAT and STATGRAPHICS offer an impressive array of different types of icon plots, including, amongst others, Chernoff’s faces, profile plots, histogram plots, star glyphs and sunray plots (Jacoby 1998 provides a detailed discussion of icon plots).

Icon plots generally use a specific visual construction to represent variables scores obtained by each individual within a sample or group. All icon plots are thus methods for displaying the response patterns for individual members of a sample, as long as those individuals are not missing any scores on the variables to be displayed (note that this is the same limitation as for radar plots and parallel coordinate displays). To illustrate icon plots, without generating too many icons to focus on, Figs. 5.15 , 5.16 , 5.17 and 5.18 present four different icon plots for QCI inspectors classified, using a new variable called BEST_WORST , as either the worst performers (= 1 where their accuracy scores were less than 70%) or the best performers (= 2 where their accuracy scores were 90% or greater).

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig15_HTML.jpg

Chernoff’s faces icon plot comparing individual attitude ratings for best and worst performing inspectors

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig16_HTML.jpg

Profile plot comparing individual attitude ratings for best and worst performing inspectors

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig17_HTML.jpg

Histogram plot comparing individual attitude ratings for best and worst performing inspectors

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig18_HTML.jpg

Sunray plot comparing individual attitude ratings for best and worst performing inspectors

The Chernoff’s faces plot gets its name from the visual icon used to represent variable scores – a cartoon-type face. This icon tries to capitalise on our natural human ability to recognise and differentiate faces. Each feature of the face is controlled by the scores on a single variable. In SYSTAT, up to 20 facial features are controllable; the first five being curvature of mouth, angle of brow, width of nose, length of nose and length of mouth (SYSTAT Software Inc., 2009 , p. 259). The theory behind Chernoff’s faces is that similar patterns of variable scores will produce similar looking faces, thereby making similarities and differences between individuals more apparent.

The profile plot and histogram plot are actually two variants of the same type of icon plot. A profile plot represents individuals’ scores for a set of variables using simplified line graphs, one per individual. The profile is scaled so that the vertical height of the peaks and valleys correspond to actual values for variables where the variables anchor the X-axis in a fashion similar to the parallel coordinate display. So, as you examine a profile from left to right across the X-axis of each graph, you are looking across the set of variables. A histogram plot represents the same information in the same way as for the profile plot but using histogram bars instead.

Figure 5.15 , produced using SYSTAT, shows a Chernoff’s faces plot for the best and worst performing inspectors using their ratings of job satisfaction, working conditions and the nine general attitude statements.

Each face is labelled with the inspector number it represents. The gaps indicate where an inspector had missing data on at least one of the variables, meaning a face could not be generated for them. The worst performers are drawn using red lines; the best using blue lines. The first variable is jobsat and this variable controls mouth curvature; the second variable is workcond and this controls angle of brow, and so on. It seems clear that there are differences in the faces between the best and worst performers with, for example, best performers tending to be more satisfied (smiling) and with higher ratings for working conditions (brow angle).

Beyond a broad visual impression, there is little in terms of precise inferences you can draw from a Chernoff’s faces plot. It really provides a visual sketch, nothing more. The fact that there is no obvious link between facial features, variables and score levels means that the Chernoff’s faces icon plot is difficult to interpret at the level of individual variables – a holistic impression of similarity and difference is what this type of plot facilitates.

Figure 5.16 produced using SYSTAT, shows a profile plot for the best and worst performing inspectors using their ratings of job satisfaction, working conditions and the nine attitude variables.

Like the Chernoff’s faces plot (Fig. 5.15 ), as you read across the rows of the plot from left to right, each plot corresponds respectively to a inspector in the sample who was either in the worst performer (red) or best performer (blue) category. The first attitude variable is jobsat and anchors the left end of each line graph; the last variable is polsatis and anchors the right end of the line graph. The remaining variables are represented in order from left to right across the X-axis of each graph. Figure 5.16 shows that these inspectors are rather different in their attitude profiles, with best performers tending to show taller profiles on the first two variables, for example.

Figure 5.17 produced using SYSTAT, shows a histogram plot for the best and worst performing inspectors based on their ratings of job satisfaction, working conditions and the nine attitude variables. This plot tells the same story as the profile plot, only using histogram bars. Some people would prefer the histogram icon plot to the profile plot because each histogram bar corresponds to one variable, making the visual linking of a specific bar to a specific variable much easier than visually linking a specific position along the profile line to a specific variable.

The sunray plot is actually a simplified adaptation of the radar plot (called a “star glyph”) used to represent scores on a set of variables for each individual within a sample or group. Remember that a radar plot basically arranges the variables around a central point like a clock face; the first variable is represented at the 12 o’clock position and the remaining variables follow around the plot in a clockwise direction.

Unlike a radar plot, while the spokes (the actual ‘star’ of the glyph’s name) of the plot are visible, no interpretive scale is evident. A variable’s score is visually represented by its distance from the central point. Thus, the star glyphs in a sunray plot are designed, like Chernoff’s faces, to provide a general visual impression, based on icon shape. A wide diameter well-rounded plot indicates an individual with high scores on all variables and a small diameter well-rounded plot vice-versa. Jagged plots represent individuals with highly variable scores across the variables. ‘Stars’ of similar size, shape and orientation represent similar individuals.

Figure 5.18 , produced using STATGRAPHICS, shows a sunray plot for the best and worst performing inspectors. An interpretation glyph is also shown in the lower right corner of Fig. 5.18 , where variables are aligned with the spokes of a star (e.g. jobsat is at the 12 o’clock position). This sunray plot could lead you to form the visual impression that the worst performing inspectors (group 1) have rather less rounded rating profiles than do the best performing inspectors (group 2) and that the jobsat and workcond spokes are generally lower for the worst performing inspectors.

Comparatively speaking, the sunray plot makes identifying similar individuals a bit easier (perhaps even easier than Chernoff’s faces) and, when ordered as STATGRAPHICS showed in Fig. 5.18 , permits easier visual comparisons between groups of individuals, but at the expense of precise knowledge about variable scores. Remember, a holistic impression is the goal pursued using a sunray plot.

Multivariate graphical methods provide summary techniques for visually presenting certain characteristics of a complex array of data on variables. Such visual representations are generally better at helping us to form holistic impressions of multivariate data rather than any sort of tabular representation or numerical index. They also allow us to compress many numerical measures into a finite representation that is generally easy to understand. Multivariate graphical displays can add interest to an otherwise dry statistical reporting of numerical data. They are designed to appeal to our pattern recognition skills, focusing our attention on features of the data such as shape, level, variability and orientation. Some multivariate graphs (e.g. radar plots, sunray plots and multiplots) are useful not only for representing score patterns for individuals but also providing summaries of score patterns across groups of individuals.

Multivariate graphs tend to get very busy-looking and are hard to interpret if a great many variables or a large number of individuals need to be displayed (imagine any of the icon plots, for a sample of 200 questionnaire participants, displayed on a A4 page – each icon would be so small that its features could not be easily distinguished, thereby defeating the purpose of the display). In such cases, using numerical summary statistics (such as averages or correlations) in tabular form alone will provide a more economical and efficient summary. Also, some multivariate displays will work better for conveying certain types of information than others.

Information about variable relationships may be better displayed using a scatterplot matrix. Information about individual similarities and difference on a set of variables may be better conveyed using a histogram or sunray plot. Multiplots may be better suited to displaying information about group differences across a set of variables. Information about the overall similarity of individual entities in a sample might best be displayed using Chernoff’s faces.

Because people differ greatly in their visual capacities and preferences, certain types of multivariate displays will work for some people and not others. Sometimes, people will not see what you see in the plots. Some plots, such as Chernoff’s faces, may not strike a reader as a serious statistical procedure and this could adversely influence how convinced they will be by the story the plot conveys. None of the multivariate displays described here provide sufficiently precise information for solid inferences or interpretations; all are designed to simply facilitate the formation of holistic visual impressions. In fact, you may have noticed that some displays (scatterplot matrices and the icon plots, for example) provide no numerical scaling information that would help make precise interpretations. If precision in summary information is desired, the types of multivariate displays discussed here would not be the best strategic choices.

Virtually any research design which produces quantitative data/statistics for multiple variables provides opportunities for multivariate graphical data display which may help to clarify or illustrate important data characteristics or relationships. Thus, for survey research involving many identically-scaled attitudinal questions, a multivariate display may be just the device needed to communicate something about patterns in the data. Multivariate graphical displays are simply specialised communication tools designed to compress a lot of information into a meaningful and efficient format for interpretation—which tool to choose depends upon the message to be conveyed.

Generally speaking, visual representations of multivariate data could prove more useful in communicating to lay persons who are unfamiliar with statistics or who prefer visual as opposed to numerical information. However, these displays would probably require some interpretive discussion so that the reader clearly understands their intent.

Procedure 5.4: Assessing Central Tendency

The three most commonly reported measures of central tendency are the mean, median and mode. Each measure reflects a specific way of defining central tendency in a distribution of scores on a variable and each has its own advantages and disadvantages.

The mean is the most widely used measure of central tendency (also called the arithmetic average). Very simply, a mean is the sum of all the scores for a specific variable in a sample divided by the number of scores used in obtaining the sum. The resulting number reflects the average score for the sample of individuals on which the scores were obtained. If one were asked to predict the score that any single individual in the sample would obtain, the best prediction, in the absence of any other relevant information, would be the sample mean. Many parametric statistical methods (such as Procedures 10.1007/978-981-15-2537-7_7#Sec22 , 10.1007/978-981-15-2537-7_7#Sec32 , 10.1007/978-981-15-2537-7_7#Sec42 and 10.1007/978-981-15-2537-7_7#Sec68) deal with sample means in one way or another. For any sample of data, there is one and only one possible value for the mean in a specific distribution. For most purposes, the mean is the preferred measure of central tendency because it utilises all the available information in a sample.

In the context of the QCI database, Maree could quite reasonably ask what inspectors scored on the average in terms of mental ability ( mentabil ), inspection accuracy ( accuracy ), inspection speed ( speed ), overall job satisfaction ( jobsat ), and perceived quality of their working conditions ( workcond ). Table 5.3 shows the mean scores for the sample of 112 quality control inspectors on each of these variables. The statistics shown in Table 5.3 were computed using the SPSS Frequencies ... procedure. Notice that the table indicates how many of the 112 inspectors had a valid score for each variable and how many were missing a score (e.g. 109 inspectors provided a valid rating for jobsat; 3 inspectors did not).

Measures of central tendency for specific QCI variables

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab3_HTML.jpg

Each mean needs to be interpreted in terms of the original units of measurement for each variable. Thus, the inspectors in the sample showed an average mental ability score of 109.84 (higher than the general population mean of 100 for the test), an average inspection accuracy of 82.14%, and an average speed for making quality control decisions of 4.48 s. Furthermore, in terms of their work context, inspectors reported an average overall job satisfaction of 4.96 (on the 7-point scale, or a level of satisfaction nearly one full scale point above the Neutral point of 4—indicating a generally positive but not strong level of job satisfaction, and an average perceived quality of work conditions of 4.21 (on the 7-point scale which is just about at the level of Stressful but Tolerable.

The mean is sensitive to the presence of extreme values, which can distort its value, giving a biased indication of central tendency. As we will see below, the median is an alternative statistic to use in such circumstances. However, it is also possible to compute what is called a trimmed mean where the mean is calculated after a certain percentage (say, 5% or 10%) of the lowest and highest scores in a distribution have been ignored (a process called ‘trimming’; see, for example, the discussion in Field 2018 , pp. 262–264). This yields a statistic less influenced by extreme scores. The drawbacks are that the decision as to what percentage to trim can be somewhat subjective and trimming necessarily sacrifices information (i.e. the extreme scores) in order to achieve a less biased measure. Some software packages, such as SPSS, SYSTAT or NCSS, can report a specific percentage trimmed mean, if that option is selected for descriptive statistics or exploratory data analysis (see Procedure 5.6 ) procedures. Comparing the original mean with a trimmed mean can provide an indication of the degree to which the original mean has been biased by extreme values.

Very simply, the median is the centre or middle score of a set of scores. By ‘centre’ or ‘middle’ is meant that 50% of the data values are smaller than or equal to the median and 50% of the data values are larger when the entire distribution of scores is rank ordered from the lowest to highest value. Thus, we can say that the median is that score in the sample which occurs at the 50th percentile. [Note that a ‘percentile’ is attached to a specific score that a specific percentage of the sample scored at or below. Thus, a score at the 25th percentile means that 25% of the sample achieved this score or a lower score.] Table 5.3 shows the 25th, 50th and 75th percentile scores for each variable – note how the 50th percentile score is exactly equal to the median in each case .

The median is reported somewhat less frequently than the mean but does have some advantages over the mean in certain circumstances. One such circumstance is when the sample of data has a few extreme values in one direction (either very large or very small relative to all other scores). In this case, the mean would be influenced (biased) to a much greater degree than would the median since all of the data are used to calculate the mean (including the extreme scores) whereas only the single centre score is needed for the median. For this reason, many nonparametric statistical procedures (such as Procedures 10.1007/978-981-15-2537-7_7#Sec27 , 10.1007/978-981-15-2537-7_7#Sec37 and 10.1007/978-981-15-2537-7_7#Sec63) focus on the median as the comparison statistic rather than on the mean.

A discrepancy between the values for the mean and median of a variable provides some insight to the degree to which the mean is being influenced by the presence of extreme data values. In a distribution where there are no extreme values on either side of the distribution (or where extreme values balance each other out on either side of the distribution, as happens in a normal distribution – see Fundamental Concept II ), the mean and the median will coincide at the same value and the mean will not be biased.

For highly skewed distributions, however, the value of the mean will be pulled toward the long tail of the distribution because that is where the extreme values lie. However, in such skewed distributions, the median will be insensitive (statisticians call this property ‘robustness’) to extreme values in the long tail. For this reason, the direction of the discrepancy between the mean and median can give a very rough indication of the direction of skew in a distribution (‘mean larger than median’ signals possible positive skewness; ‘mean smaller than median’ signals possible negative skewness). Like the mean, there is one and only one possible value for the median in a specific distribution.

In Fig. 5.19 , the left graph shows the distribution of speed scores and the right-hand graph shows the distribution of accuracy scores. The speed distribution clearly shows the mean being pulled toward the right tail of the distribution whereas the accuracy distribution shows the mean being just slightly pulled toward the left tail. The effect on the mean is stronger in the speed distribution indicating a greater biasing effect due to some very long inspection decision times.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig19_HTML.jpg

Effects of skewness in a distribution on the values for the mean and median

If we refer to Table 5.3 , we can see that the median score for each of the five variables has also been computed. Like the mean, the median must be interpreted in the original units of measurement for the variable. We can see that for mentabil , accuracy , and workcond , the value of the median is very close to the value of the mean, suggesting that these distributions are not strongly influenced by extreme data values in either the high or low direction. However, note that the median speed was 3.89 s compared to the mean of 4.48 s, suggesting that the distribution of speed scores is positively skewed (the mean is larger than the median—refer to Fig. 5.19 ). Conversely, the median jobsat score was 5.00 whereas the mean score was 4.96 suggesting very little substantive skewness in the distribution (mean and median are nearly equal).

The mode is the simplest measure of central tendency. It is defined as the most frequently occurring score in a distribution. Put another way, it is the score that more individuals in the sample obtain than any other score. An interesting problem associated with the mode is that there may be more than one in a specific distribution. In the case where multiple modes exist, the issue becomes which value do you report? The answer is that you must report all of them. In a ‘normal’ bell-shaped distribution, there is only one mode and it is indeed at the centre of the distribution, coinciding with both the mean and the median.

Table 5.3 also shows the mode for each of the five variables. For example, more inspectors achieved a mentabil score of 111 more often than any other score and inspectors reported a jobsat rating of 6 more often than any other rating. SPSS only ever reports one mode even if several are present, so one must be careful and look at a histogram plot for each variable to make a final determination of the mode(s) for that variable.

All three measures of central tendency yield information about what is going on in the centre of a distribution of scores. The mean and median provide a single number which can summarise the central tendency in the entire distribution. The mode can yield one or multiple indices. With many measurements on individuals in a sample, it is advantageous to have single number indices which can describe the distributions in summary fashion. In a normal or near-normal distribution of sample data, the mean, the median, and the mode will all generally coincide at the one point. In this instance, all three statistics will provide approximately the same indication of central tendency. Note however that it is seldom the case that all three statistics would yield exactly the same number for any particular distribution. The mean is the most useful statistic, unless the data distribution is skewed by extreme scores, in which case the median should be reported.

While measures of central tendency are useful descriptors of distributions, summarising data using a single numerical index necessarily reduces the amount of information available about the sample. Not only do we need to know what is going on in the centre of a distribution, we also need to know what is going on around the centre of the distribution. For this reason, most social and behavioural researchers report not only measures of central tendency, but also measures of variability (see Procedure 5.5 ). The mode is the least informative of the three statistics because of its potential for producing multiple values.

Measures of central tendency are useful in almost any type of experimental design, survey or interview study, and in any observational studies where quantitative data are available and must be summarised. The decision as to whether the mean or median should be reported depends upon the nature of the data which should ideally be ascertained by visual inspection of the data distribution. Some researchers opt to report both measures routinely. Computation of means is a prelude to many parametric statistical methods (see, for example, Procedure 10.1007/978-981-15-2537-7_7#Sec22 , 10.1007/978-981-15-2537-7_7#Sec32 , 10.1007/978-981-15-2537-7_7#Sec42 , 10.1007/978-981-15-2537-7_7#Sec52 , 10.1007/978-981-15-2537-7_7#Sec68 , 10.1007/978-981-15-2537-7_7#Sec76 and 10.1007/978-981-15-2537-7_7#Sec105); comparison of medians is associated with many nonparametric statistical methods (see, for example, Procedure 10.1007/978-981-15-2537-7_7#Sec27 , 10.1007/978-981-15-2537-7_7#Sec37 , 10.1007/978-981-15-2537-7_7#Sec63 and 10.1007/978-981-15-2537-7_7#Sec81).

Procedure 5.5: Assessing Variability

There are a variety of measures of variability to choose from including the range, interquartile range, variance and standard deviation. Each measure reflects a specific way of defining variability in a distribution of scores on a variable and each has its own advantages and disadvantages. Most measures of variability are associated with a specific measure of central tendency so that researchers are now commonly expected to report both a measure of central tendency and its associated measure of variability whenever they display numerical descriptive statistics on continuous or ranked-ordered variables.

This is the simplest measure of variability for a sample of data scores. The range is merely the largest score in the sample minus the smallest score in the sample. The range is the one measure of variability not explicitly associated with any measure of central tendency. It gives a very rough indication as to the extent of spread in the scores. However, since the range uses only two of the total available scores in the sample, the rest of the scores are ignored, which means that a lot of potentially useful information is being sacrificed. There are also problems if either the highest or lowest (or both) scores are atypical or too extreme in their value (as in highly skewed distributions). When this happens, the range gives a very inflated picture of the typical variability in the scores. Thus, the range tends not be a frequently reported measure of variability.

Table 5.4 shows a set of descriptive statistics, produced by the SPSS Frequencies procedure, for the mentabil, accuracy, speed, jobsat and workcond measures in the QCI database. In the table, you will find three rows labelled ‘Range’, ‘Minimum’ and ‘Maximum’.

Measures of central tendency and variability for specific QCI variables

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab4_HTML.jpg

Using the data from these three rows, we can draw the following descriptive picture. Mentabil scores spanned a range of 50 (from a minimum score of 85 to a maximum score of 135). Speed scores had a range of 16.05 s (from 1.05 s – the fastest quality decision to 17.10 – the slowest quality decision). Accuracy scores had a range of 43 (from 57% – the least accurate inspector to 100% – the most accurate inspector). Both work context measures ( jobsat and workcond ) exhibited a range of 6 – the largest possible range given the 1 to 7 scale of measurement for these two variables.

Interquartile Range

The Interquartile Range ( IQR ) is a measure of variability that is specifically designed to be used in conjunction with the median. The IQR also takes care of the extreme data problem which typically plagues the range measure. The IQR is defined as the range that is covered by the middle 50% of scores in a distribution once the scores have been ranked in order from lowest value to highest value. It is found by locating the value in the distribution at or below which 25% of the sample scored and subtracting this number from the value in the distribution at or below which 75% of the sample scored. The IQR can also be thought of as the range one would compute after the bottom 25% of scores and the top 25% of scores in the distribution have been ‘chopped off’ (or ‘trimmed’ as statisticians call it).

The IQR gives a much more stable picture of the variability of scores and, like the median, is relatively insensitive to the biasing effects of extreme data values. Some behavioural researchers prefer to divide the IQR in half which gives a measure called the Semi-Interquartile Range ( S-IQR ) . The S-IQR can be interpreted as the distance one must travel away from the median, in either direction, to reach the value which separates the top (or bottom) 25% of scores in the distribution from the remaining 75%.

The IQR or S-IQR is typically not produced by descriptive statistics procedures by default in many computer software packages; however, it can usually be requested as an optional statistic to report or it can easily be computed by hand using percentile scores. Both the median and the IQR figure prominently in Exploratory Data Analysis, particularly in the production of boxplots (see Procedure 5.6 ).

Figure 5.20 illustrates the conceptual nature of the IQR and S-IQR compared to that of the range. Assume that 100% of data values are covered by the distribution curve in the figure. It is clear that these three measures would provide very different values for a measure of variability. Your choice would depend on your purpose. If you simply want to signal the overall span of scores between the minimum and maximum, the range is the measure of choice. But if you want to signal the variability around the median, the IQR or S-IQR would be the measure of choice.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig20_HTML.jpg

How the range, IQR and S-IQR measures of variability conceptually differ

Note: Some behavioural researchers refer to the IQR as the hinge-spread (or H-spread ) because of its use in the production of boxplots:

  • the 25th percentile data value is referred to as the ‘lower hinge’;
  • the 75th percentile data value is referred to as the ‘upper hinge’; and
  • their difference gives the H-spread.

Midspread is another term you may see used as a synonym for interquartile range.

Referring back to Table 5.4 , we can find statistics reported for the median and for the ‘quartiles’ (25th, 50th and 75th percentile scores) for each of the five variables of interest. The ‘quartile’ values are useful for finding the IQR or S-IQR because SPSS does not report these measures directly. The median clearly equals the 50th percentile data value in the table.

If we focus, for example, on the speed variable, we could find its IQR by subtracting the 25th percentile score of 2.19 s from the 75th percentile score of 5.71 s to give a value for the IQR of 3.52 s (the S-IQR would simply be 3.52 divided by 2 or 1.76 s). Thus, we could report that the median decision speed for inspectors was 3.89 s and that the middle 50% of inspectors showed scores spanning a range of 3.52 s. Alternatively, we could report that the median decision speed for inspectors was 3.89 s and that the middle 50% of inspectors showed scores which ranged 1.76 s either side of the median value.

Note: We could compare the ‘Minimum’ or ‘Maximum’ scores to the 25th percentile score and 75th percentile score respectively to get a feeling for whether the minimum or maximum might be considered extreme or uncharacteristic data values.

The variance uses information from every individual in the sample to assess the variability of scores relative to the sample mean. Variance assesses the average squared deviation of each score from the mean of the sample. Deviation refers to the difference between an observed score value and the mean of the sample—they are squared simply because adding them up in their naturally occurring unsquared form (where some differences are positive and others are negative) always gives a total of zero, which is useless for an index purporting to measure something.

If many scores are quite different from the mean, we would expect the variance to be large. If all the scores lie fairly close to the sample mean, we would expect a small variance. If all scores exactly equal the mean (i.e. all the scores in the sample have the same value), then we would expect the variance to be zero.

Figure 5.21 illustrates some possibilities regarding variance of a distribution of scores having a mean of 100. The very tall curve illustrates a distribution with small variance. The distribution of medium height illustrates a distribution with medium variance and the flattest distribution ia a distribution with large variance.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig21_HTML.jpg

The concept of variance

If we had a distribution with no variance, the curve would simply be a vertical line at a score of 100 (meaning that all scores were equal to the mean). You can see that as variance increases, the tails of the distribution extend further outward and the concentration of scores around the mean decreases. You may have noticed that variance and range (as well as the IQR) will be related, since the range focuses on the difference between the ends of the two tails in the distribution and larger variances extend the tails. So, a larger variance will generally be associated with a larger range and IQR compared to a smaller variance.

It is generally difficult to descriptively interpret the variance measure in a meaningful fashion since it involves squared deviations around the sample mean. [Note: If you look back at Table 5.4 , you will see the variance listed for each of the variables (e.g. the variance of accuracy scores is 84.118), but the numbers themselves make little sense and do not relate to the original measurement scale for the variables (which, for the accuracy variable, went from 0% to 100% accuracy).] Instead, we use the variance as a steppingstone for obtaining a measure of variability that we can clearly interpret, namely the standard deviation . However, you should know that variance is an important concept in its own right simply because it provides the statistical foundation for many of the correlational procedures and statistical inference procedures described in Chaps. 10.1007/978-981-15-2537-7_6 , 10.1007/978-981-15-2537-7_7 and 10.1007/978-981-15-2537-7_8.

When considering either correlations or tests of statistical hypotheses, we frequently speak of one variable explaining or sharing variance with another (see Procedure 10.1007/978-981-15-2537-7_6#Sec27 and 10.1007/978-981-15-2537-7_7#Sec47 ). In doing so, we are invoking the concept of variance as set out here—what we are saying is that variability in the behaviour of scores on one particular variable may be associated with or predictive of variability in scores on another variable of interest (e.g. it could explain why those scores have a non-zero variance).

Standard Deviation

The standard deviation (often abbreviated as SD, sd or Std. Dev.) is the most commonly reported measure of variability because it has a meaningful interpretation and is used in conjunction with reports of sample means. Variance and standard deviation are closely related measures in that the standard deviation is found by taking the square root of the variance. The standard deviation, very simply, is a summary number that reflects the ‘average distance of each score from the mean of the sample’. In many parametric statistical methods, both the sample mean and sample standard deviation are employed in some form. Thus, the standard deviation is a very important measure, not only for data description, but also for hypothesis testing and the establishment of relationships as well.

Referring again back to Table 5.4 , we’ll focus on the results for the speed variable for discussion purposes. Table 5.4 shows that the mean inspection speed for the QCI sample was 4.48 s. We can also see that the standard deviation (in the row labelled ‘Std Deviation’) for speed was 2.89 s.

This standard deviation has a straightforward interpretation: we would say that ‘on the average, an inspector’s quality inspection decision speed differed from the mean of the sample by about 2.89 s in either direction’. In a normal distribution of scores (see Fundamental Concept II ), we would expect to see about 68% of all inspectors having decision speeds between 1.59 s (the mean minus one amount of the standard deviation) and 7.37 s (the mean plus one amount of the standard deviation).

We noted earlier that the range of the speed scores was 16.05 s. However, the fact that the maximum speed score was 17.1 s compared to the 75th percentile score of just 5.71 s seems to suggest that this maximum speed might be rather atypically large compared to the bulk of speed scores. This means that the range is likely to be giving us a false impression of the overall variability of the inspectors’ decision speeds.

Furthermore, given that the mean speed score was higher than the median speed score, suggesting that speed scores were positively skewed (this was confirmed by the histogram for speed shown in Fig. 5.19 in Procedure 5.4 ), we might consider emphasising the median and its associated IQR or S-IQR rather than the mean and standard deviation. Of course, similar diagnostic and interpretive work could be done for each of the other four variables in Table 5.4 .

Measures of variability (particularly the standard deviation) provide a summary measure that gives an indication of how variable (spread out) a particular sample of scores is. When used in conjunction with a relevant measure of central tendency (particularly the mean), a reasonable yet economical description of a set of data emerges. When there are extreme data values or severe skewness is present in the data, the IQR (or S-IQR) becomes the preferred measure of variability to be reported in conjunction with the sample median (or 50th percentile value). These latter measures are much more resistant (‘robust’) to influence by data anomalies than are the mean and standard deviation.

As mentioned above, the range is a very cursory index of variability, thus, it is not as useful as variance or standard deviation. Variance has little meaningful interpretation as a descriptive index; hence, standard deviation is most often reported. However, the standard deviation (or IQR) has little meaning if the sample mean (or median) is not reported along with it.

Knowing that the standard deviation for accuracy is 9.17 tells you little unless you know the mean accuracy (82.14) that it is the standard deviation from.

Like the sample mean, the standard deviation can be strongly biased by the presence of extreme data values or severe skewness in a distribution in which case the median and IQR (or S-IQR) become the preferred measures. The biasing effect will be most noticeable in samples which are small in size (say, less than 30 individuals) and far less noticeable in large samples (say, in excess of 200 or 300 individuals). [Note that, in a manner similar to a trimmed mean, it is possible to compute a trimmed standard deviation to reduce the biasing effect of extreme data values, see Field 2018 , p. 263.]

It is important to realise that the resistance of the median and IQR (or S-IQR) to extreme values is only gained by deliberately sacrificing a good deal of the information available in the sample (nothing is obtained without a cost in statistics). What is sacrificed is information from all other members of the sample other than those members who scored at the median and 25th and 75th percentile points on a variable of interest; information from all members of the sample would automatically be incorporated in mean and standard deviation for that variable.

Any investigation where you might report on or read about measures of central tendency on certain variables should also report measures of variability. This is particularly true for data from experiments, quasi-experiments, observational studies and questionnaires. It is important to consider measures of central tendency and measures of variability to be inextricably linked—one should never report one without the other if an adequate descriptive summary of a variable is to be communicated.

Other descriptive measures, such as those for skewness and kurtosis 1 may also be of interest if a more complete description of any variable is desired. Most good statistical packages can be instructed to report these additional descriptive measures as well.

Of all the statistics you are likely to encounter in the business, behavioural and social science research literature, means and standard deviations will dominate as measures for describing data. Additionally, these statistics will usually be reported when any parametric tests of statistical hypotheses are presented as the mean and standard deviation provide an appropriate basis for summarising and evaluating group differences.

Fundamental Concept I: Basic Concepts in Probability

The concept of simple probability.

In Procedures 5.1 and 5.2 , you encountered the idea of the frequency of occurrence of specific events such as particular scores within a sample distribution. Furthermore, it is a simple operation to convert the frequency of occurrence of a specific event into a number representing the relative frequency of that event. The relative frequency of an observed event is merely the number of times the event is observed divided by the total number of times one makes an observation. The resulting number ranges between 0 and 1 but we typically re-express this number as a percentage by multiplying it by 100%.

In the QCI database, Maree Lakota observed data from 112 quality control inspectors of which 58 were male and 51 were female (gender indications were missing for three inspectors). The statistics 58 and 51 are thus the frequencies of occurrence for two specific types of research participant, a male inspector or a female inspector.

If she divided each frequency by the total number of observations (i.e. 112), whe would obtain .52 for males and .46 for females (leaving .02 of observations with unknown gender). These statistics are relative frequencies which indicate the proportion of times that Maree obtained data from a male or female inspector. Multiplying each relative frequency by 100% would yield 52% and 46% which she could interpret as indicating that 52% of her sample was male and 46% was female (leaving 2% of the sample with unknown gender).

It does not take much of a leap in logic to move from the concept of ‘relative frequency’ to the concept of ‘probability’. In our discussion above, we focused on relative frequency as indicating the proportion or percentage of times a specific category of participant was obtained in a sample. The emphasis here is on data from a sample.

Imagine now that Maree had infinite resources and research time and was able to obtain ever larger samples of quality control inspectors for her study. She could still compute the relative frequencies for obtaining data from males and females in her sample but as her sample size grew larger and larger, she would notice these relative frequencies converging toward some fixed values.

If, by some miracle, Maree could observe all of the quality control inspectors on the planet today, she would have measured the entire population and her computations of relative frequency for males and females would yield two precise numbers, each indicating the proportion of the population of inspectors that was male and the proportion that was female.

If Maree were then to list all of these inspectors and randomly choose one from the list, the chances that she would choose a male inspector would be equal to the proportion of the population of inspectors that was male and this logic extends to choosing a female inspector. The number used to quantify this notion of ‘chances’ is called a probability. Maree would therefore have established the probability of randomly observing a male or a female inspector in the population on any specific occasion.

Probability is expressed on a 0.0 (the observation or event will certainly not be seen) to 1.0 (the observation or event will certainly be seen) scale where values close to 0.0 indicate observations that are less certain to be seen and values close to 1.0 indicate observations that are more certain to be seen (a value of .5 indicates an even chance that an observation or event will or will not be seen – a state of maximum uncertainty). Statisticians often interpret a probability as the likelihood of observing an event or type of individual in the population.

In the QCI database, we noted that the relative frequency of observing males was .52 and for females was .46. If we take these relative frequencies as estimates of the proportions of each gender in the population of inspectors, then .52 and .46 represent the probability of observing a male or female inspector, respectively.

Statisticians would state this as “the probability of observing a male quality control inspector is .52” or in a more commonly used shorthand code, the likelihood of observing a male quality control inspector is p = .52 (p for probability). For some, probabilities make more sense if they are converted to percentages (by multiplying by 100%). Thus, p = .52 can also understood as a 52% chance of observing a male quality control inspector.

We have seen that relative frequency is a sample statistic that can be used to estimate the population probability. Our estimate will get more precise as we use larger and larger samples (technically, as the size of our samples more closely approximates the size of our population). In most behavioural research, we never have access to entire populations so we must always estimate our probabilities.

In some very special populations, having a known number of fixed possible outcomes, such as results of coin tosses or rolls of a die, we can analytically establish event probabilities without doing an infinite number of observations; all we must do is assume that we have a fair coin or die. Thus, with a fair coin, the probability of observing a H or a T on any single coin toss is ½ or .5 or 50%; the probability of observing a 6 on any single throw of a die is 1/6 or .16667 or 16.667%. With behavioural data, though, we can never measure all possible behavioural outcomes, which thereby forces researchers to depend on samples of observations in order to make estimates of population values.

The concept of probability is central to much of what is done in the statistical analysis of behavioural data. Whenever a behavioural scientist wishes to establish whether a particular relationship exists between variables or whether two groups, treated differently, actually show different behaviours, he/she is playing a probability game. Given a sample of observations, the behavioural scientist must decide whether what he/she has observed is providing sufficient information to conclude something about the population from which the sample was drawn.

This decision always has a non-zero probability of being in error simply because in samples that are much smaller than the population, there is always the chance or probability that we are observing something rare and atypical instead of something which is indicative of a consistent population trend. Thus, the concept of probability forms the cornerstone for statistical inference about which we will have more to say later (see Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec6). Probability also plays an important role in helping us to understand theoretical statistical distributions (e.g. the normal distribution) and what they can tell us about our observations. We will explore this idea further in Fundamental Concept II .

The Concept of Conditional Probability

It is important to understand that the concept of probability as described above focuses upon the likelihood or chances of observing a specific event or type of observation for a specific variable relative to a population or sample of observations. However, many important behavioural research issues may focus on the question of the probability of observing a specific event given that the researcher has knowledge that some other event has occurred or been observed (this latter event is usually measured by a second variable). Here, the focus is on the potential relationship or link between two variables or two events.

With respect to the QCI database, Maree could ask the quite reasonable question “what is the probability (estimated in the QCI sample by a relative frequency) of observing an inspector being female given that she knows that an inspector works for a Large Business Computer manufacturer.

To address this question, all she needs to know is:

  • how many inspectors from Large Business Computer manufacturers are in the sample ( 22 ); and
  • how many of those inspectors were female ( 7 ) (inspectors who were missing a score for either company or gender have been ignored here).

If she divides 7 by 22, she would obtain the probability that an inspector is female given that they work for a Large Business Computer manufacturer – that is, p = .32 .

This type of question points to the important concept of conditional probability (‘conditional’ because we are asking “what is the probability of observing one event conditional upon our knowledge of some other event”).

Continuing with the previous example, Maree would say that the conditional probability of observing a female inspector working for a Large Business Computer manufacturer is .32 or, equivalently, a 32% chance. Compare this conditional probability of p  = .32 to the overall probability of observing a female inspector in the entire sample ( p  = .46 as shown above).

This means that there is evidence for a connection or relationship between gender and the type of company an inspector works for. That is, the chances are lower for observing a female inspector from a Large Business Computer manufacturer than they are for simply observing a female inspector at all.

Maree therefore has evidence suggesting that females may be relatively under-represented in Large Business Computer manufacturing companies compared to the overall population. Knowing something about the company an inspector works for therefore can help us make a better prediction about their likely gender.

Suppose, however, that Maree’s conditional probability had been exactly equal to p  = .46. This would mean that there was exactly the same chance of observing a female inspector working for a Large Business Computer manufacturer as there was of observing a female inspector in the general population. Here, knowing something about the company an inspector works doesn’t help Maree make any better prediction about their likely gender. This would mean that the two variables are statistically independent of each other.

A classic case of events that are statistically independent is two successive throws of a fair die: rolling a six on the first throw gives us no information for predicting how likely it will be that we would roll a six on the second throw. The conditional probability of observing a six on the second throw given that I have observed a six on the first throw is 0.16667 (= 1 divided by 6) which is the same as the simple probability of observing a six on any specific throw. This statistical independence also means that if we wanted to know what the probability of throwing two sixes on two successive throws of a fair die, we would just multiply the probabilities for each independent event (i.e., throw) together; that is, .16667 × .16667 = .02789 (this is known as the multiplication rule of probability, see, for example, Smithson 2000 , p. 114).

Finally, you should know that conditional probabilities are often asymmetric. This means that for many types of behavioural variables, reversing the conditional arrangement will change the story about the relationship. Bayesian statistics (see Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec73) relies heavily upon this asymmetric relationship between conditional probabilities.

Maree has already learned that the conditional probability that an inspector is female given that they worked for a Large Business Computer manufacturer is p = .32. She could easily turn the conditional relationship around and ask what is the conditional probability that an inspector works for a Large Business Computer manufacturer given that the inspector is female?

From the QCI database, she can find that 51 inspectors in her total sample were female and of those 51, 7 worked for a Large Business Computer manufacturer. If she divided 7 by 51, she would get p = .14 (did you notice that all that changed was the number she divided by?). Thus, there is only a 14% chance of observing an inspector working for a Large Business Computer manufacturer given that the inspector is female – a rather different probability from p = .32, which tells a different story.

As you will see in Procedures 10.1007/978-981-15-2537-7_6#Sec14 and 10.1007/978-981-15-2537-7_7#Sec17, conditional relationships between categorical variables are precisely what crosstabulation contingency tables are designed to reveal.

Procedure 5.6: Exploratory Data Analysis

There are a variety of visual display methods for EDA, including stem & leaf displays, boxplots and violin plots. Each method reflects a specific way of displaying features of a distribution of scores or measurements and, of course, each has its own advantages and disadvantages. In addition, EDA displays are surprisingly flexible and can combine features in various ways to enhance the story conveyed by the plot.

Stem & Leaf Displays

The stem & leaf display is a simple data summary technique which not only rank orders the data points in a sample but presents them visually so that the shape of the data distribution is reflected. Stem & leaf displays are formed from data scores by splitting each score into two parts: the first part of each score serving as the ‘stem’, the second part as the ‘leaf’ (e.g. for 2-digit data values, the ‘stem’ is the number in the tens position; the ‘leaf’ is the number in the ones position). Each stem is then listed vertically, in ascending order, followed horizontally by all the leaves in ascending order associated with it. The resulting display thus shows all of the scores in the sample, but reorganised so that a rough idea of the shape of the distribution emerges. As well, extreme scores can be easily identified in a stem & leaf display.

Consider the accuracy and speed scores for the 112 quality control inspectors in the QCI sample. Figure 5.22 (produced by the R Commander Stem-and-leaf display … procedure) shows the stem & leaf displays for inspection accuracy (left display) and speed (right display) data.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig22_HTML.jpg

Stem & leaf displays produced by R Commander

[The first six lines reflect information from R Commander about each display: lines 1 and 2 show the actual R command used to produce the plot (the variable name has been highlighted in bold); line 3 gives a warning indicating that inspectors with missing values (= NA in R ) on the variable have been omitted from the display; line 4 shows how the stems and leaves have been defined; line 5 indicates what a leaf unit represents in value; and line 6 indicates the total number (n) of inspectors included in the display).] In Fig. 5.22 , for the accuracy display on the left-hand side, the ‘stems’ have been split into ‘half-stems’—one (which is starred) associated with the ‘leaves’ 0 through 4 and the other associated with the ‘leaves’ 5 through 9—a strategy that gives the display better balance and visual appeal.

Notice how the left stem & leaf display conveys a fairly clear (yet sideways) picture of the shape of the distribution of accuracy scores. It has a rather symmetrical bell-shape to it with only a slight suggestion of negative skewness (toward the extreme score at the top). The right stem & leaf display clearly depicts the highly positively skewed nature of the distribution of speed scores. Importantly, we could reconstruct the entire sample of scores for each variable using its display, which means that unlike most other graphical procedures, we didn’t have to sacrifice any information to produce the visual summary.

Some programs, such as SYSTAT, embellish their stem & leaf displays by indicating in which stem or half-stem the ‘median’ (50th percentile), the ‘upper hinge score’ (75th percentile), and ‘lower hinge score’ (25th percentile) occur in the distribution (recall the discussion of interquartile range in Procedure 5.5 ). This is shown in Fig. 5.23 , produced by SYSTAT, where M and H indicate the stem locations for the median and hinge points, respectively. This stem & leaf display labels a single extreme accuracy score as an ‘outside value’ and clearly shows that this actual score was 57.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig23_HTML.jpg

Stem & leaf display, produced by SYSTAT, of the accuracy QCI variable

Another important EDA technique is the boxplot or, as it is sometimes known, the box-and-whisker plot . This plot provides a symbolic representation that preserves less of the original nature of the data (compared to a stem & leaf display) but typically gives a better picture of the distributional characteristics. The basic boxplot, shown in Fig. 5.24 , utilises information about the median (50th percentile score) and the upper (75th percentile score) and lower (25th percentile score) hinge points in the construction of the ‘box’ portion of the graph (the ‘median’ defines the centre line in the box; the ‘upper’ and ‘lower hinge values’ define the end boundaries of the box—thus the box encompasses the middle 50% of data values).

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig24_HTML.jpg

Boxplots for the accuracy and speed QCI variables

Additionally, the boxplot utilises the IQR (recall Procedure 5.5 ) as a way of defining what are called ‘fences’ which are used to indicate score boundaries beyond which we would consider a score in a distribution to be an ‘outlier’ (or an extreme or unusual value). In SPSS, the inner fence is typically defined as 1.5 times the IQR in each direction and a ‘far’ outlier or extreme case is typically defined as 3 times the IQR in either direction (Field 2018 , p. 193). The ‘whiskers’ in a boxplot extend out to the data values which are closest to the upper and lower inner fences (in most cases, the vast majority of data values will be contained within the fences). Outliers beyond these ‘whiskers’ are then individually listed. ‘Near’ outliers are those lying just beyond the inner fences and ‘far’ outliers lie well beyond the inner fences.

Figure 5.24 shows two simple boxplots (produced using SPSS), one for the accuracy QCI variable and one for the speed QCI variable. The accuracy plot shows a median value of about 83, roughly 50% of the data fall between about 77 and 89 and there is one outlier, inspector 83, in the lower ‘tail’ of the distribution. The accuracy boxplot illustrates data that are relatively symmetrically distributed without substantial skewness. Such data will tend to have their median in the middle of the box, whiskers of roughly equal length extending out from the box and few or no outliers.

The speed plot shows a median value of about 4 s, roughly 50% of the data fall between 2 s and 6 s and there are four outliers, inspectors 7, 62, 65 and 75 (although inspectors 65 and 75 fall at the same place and are rather difficult to read), all falling in the slow speed ‘tail’ of the distribution. Inspectors 65, 75 and 7 are shown as ‘near’ outliers (open circles) whereas inspector 62 is shown as a ‘far’ outlier (asterisk). The speed boxplot illustrates data which are asymmetrically distributed because of skewness in one direction. Such data may have their median offset from the middle of the box and/or whiskers of unequal length extending out from the box and outliers in the direction of the longer whisker. In the speed boxplot, the data are clearly positively skewed (the longer whisker and extreme values are in the slow speed ‘tail’).

Boxplots are very versatile representations in that side-by-side displays for sub-groups of data within a sample can permit easy visual comparisons of groups with respect to central tendency and variability. Boxplots can also be modified to incorporate information about error bands associated with the median producing what is called a ‘notched boxplot’. This helps in the visual detection of meaningful subgroup differences, where boxplot ‘notches’ don’t overlap.

Figure 5.25 (produced using NCSS), compares the distributions of accuracy and speed scores for QCI inspectors from the five types of companies, plotted side-by-side.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig25_HTML.jpg

Comparisons of the accuracy (regular boxplots) and speed (notched boxplots) QCI variables for different types of companies

Focus first on the left graph in Fig. 5.25 which plots the distribution of accuracy scores broken down by company using regular boxplots. This plot clearly shows the differing degree of skewness in each type of company (indicated by one or more outliers in one ‘tail’, whiskers which are not the same length and/or the median line being offset from the centre of a box), the differing variability of scores within each type of company (indicated by the overall length of each plot—box and whiskers), and the differing central tendency in each type of company (the median lines do not all fall at the same level of accuracy score). From the left graph in Fig. 5.25 , we could conclude that: inspection accuracy scores are most variable in PC and Large Electrical Appliance manufacturing companies and least variable in the Large Business Computer manufacturing companies; Large Business Computer and PC manufacturing companies have the highest median level of inspection accuracy; and inspection accuracy scores tend to be negatively skewed (many inspectors toward higher levels, relatively fewer who are poorer in inspection performance) in the Automotive manufacturing companies. One inspector, working for an Automotive manufacturing company, shows extremely poor inspection accuracy performance.

The right display compares types of companies in terms of their inspection speed scores, using’ notched’ boxplots. The notches define upper and lower error limits around each median. Aside from the very obvious positive skewness for speed scores (with a number of slow speed outliers) in every type of company (least so for Large Electrical Appliance manufacturing companies), the story conveyed by this comparison is that inspectors from Large Electrical Appliance and Automotive manufacturing companies have substantially faster median decision speeds compared to inspectors from Large Business Computer and PC manufacturing companies (i.e. their ‘notches’ do not overlap, in terms of speed scores, on the display).

Boxplots can also add interpretive value to other graphical display methods through the creation of hybrid displays. Such displays might combine a standard histogram with a boxplot along the X-axis to provide an enhanced picture of the data distribution as illustrated for the mentabil variable in Fig. 5.26 (produced using NCSS). This hybrid plot also employs a data ‘smoothing’ method called a density trace to outline an approximate overall shape for the data distribution. Any one graphical method would tell some of the story, but combined in the hybrid display, the story of a relatively symmetrical set of mentabil scores becomes quite visually compelling.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig26_HTML.jpg

A hybrid histogram-density-boxplot of the mentabil QCI variable

Violin Plots

Violin plots are a more recent and interesting EDA innovation, implemented in the NCSS software package (Hintze 2012 ). The violin plot gets its name from the rough shape that the plots tend to take on. Violin plots are another type of hybrid plot, this time combining density traces (mirror-imaged right and left so that the plots have a sense of symmetry and visual balance) with boxplot-type information (median, IQR and upper and lower inner ‘fences’, but not outliers). The goal of the violin plot is to provide a quick visual impression of the shape, central tendency and variability of a distribution (the length of the violin conveys a sense of the overall variability whereas the width of the violin conveys a sense of the frequency of scores occurring in a specific region).

Figure 5.27 (produced using NCSS), compares the distributions of speed scores for QCI inspectors across the five types of companies, plotted side-by-side. The violin plot conveys a similar story to the boxplot comparison for speed in the right graph of Fig. 5.25 . However, notice that with the violin plot, unlike with a boxplot, you also get a sense of distributions that have ‘clumps’ of scores in specific areas. Some violin plots, like that for Automobile manufacturing companies in Fig. 5.27 , have a shape suggesting a multi-modal distribution (recall Procedure 5.4 and the discussion of the fact that a distribution may have multiple modes). The violin plot in Fig. 5.27 has also been produced to show where the median (solid line) and mean (dashed line) would fall within each violin. This facilitates two interpretations: (1) a relative comparison of central tendency across the five companies and (2) relative degree of skewness in the distribution for each company (indicated by the separation of the two lines within a violin; skewness is particularly bad for the Large Business Computer manufacturing companies).

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig27_HTML.jpg

Violin plot comparisons of the speed QCI variable for different types of companies

EDA methods (of which we have illustrated only a small subset; we have not reviewed dot density diagrams, for example) provide summary techniques for visually displaying certain characteristics of a set of data. The advantage of the EDA methods over more traditional graphing techniques such as those described in Procedure 5.2 is that as much of the original integrity of the data is maintained as possible while maximising the amount of summary information available about distributional characteristics.

Stem & leaf displays maintain the data in as close to their original form as possible whereas boxplots and violin plots provide more symbolic and flexible representations. EDA methods are best thought of as communication devices designed to facilitate quick visual impressions and they can add interest to any statistical story being conveyed about a sample of data. NCSS, SYSTAT, STATGRAPHICS and R Commander generally offer more options and flexibility in the generation of EDA displays than SPSS.

EDA methods tend to get cumbersome if a great many variables or groups need to be summarised. In such cases, using numerical summary statistics (such as means and standard deviations) will provide a more economical and efficient summary. Boxplots or violin plots are generally more space efficient summary techniques than stem & leaf displays.

Often, EDA techniques are used as data screening devices, which are typically not reported in actual write-ups of research (we will discuss data screening in more detail in Procedure 10.1007/978-981-15-2537-7_8#Sec11). This is a perfectly legitimate use for the methods although there is an argument for researchers to put these techniques to greater use in published literature.

Software packages may use different rules for constructing EDA plots which means that you might get rather different looking plots and different information from different programs (you saw some evidence of this in Figs. 5.22 and 5.23 ). It is important to understand what the programs are using as decision rules for locating fences and outliers so that you are clear on how best to interpret the resulting plot—such information is generally contained in the user’s guides or manuals for NCSS (Hintze 2012 ), SYSTAT (SYSTAT Inc. 2009a , b ), STATGRAPHICS (StatPoint Technologies Inc. 2010 ) and SPSS (Norušis 2012 ).

Virtually any research design which produces numerical measures (even to the extent of just counting the number of occurrences of several events) provides opportunities for employing EDA displays which may help to clarify data characteristics or relationships. One extremely important use of EDA methods is as data screening devices for detecting outliers and other data anomalies, such as non-normality and skewness, before proceeding to parametric statistical analyses. In some cases, EDA methods can help the researcher to decide whether parametric or nonparametric statistical tests would be best to apply to his or her data because critical data characteristics such as distributional shape and spread are directly reflected.

Procedure 5.7: Standard ( z ) Scores

In certain practical situations in behavioural research, it may be desirable to know where a specific individual’s score lies relative to all other scores in a distribution. A convenient measure is to observe how many standard deviations (see Procedure 5.5 ) above or below the sample mean a specific score lies. This measure is called a standard score or z -score . Very simply, any raw score can be converted to a z -score by subtracting the sample mean from the raw score and dividing that result by the sample’s standard deviation. z -scores can be positive or negative and their sign simply indicates whether the score lies above (+) or below (−) the mean in value. A z -score has a very simple interpretation: it measures the number of standard deviations above or below the sample mean a specific raw score lies.

In the QCI database, we have a sample mean for speed scores of 4.48 s, a standard deviation for speed scores of 2.89 s (recall Table 5.4 in Procedure 5.5 ). If we are interested in the z -score for Inspector 65’s raw speed score of 11.94 s, we would obtain a z -score of +2.58 using the method described above (subtract 4.48 from 11.94 and divide the result by 2.89). The interpretation of this number is that a raw decision speed score of 11.94 s lies about 2.9 standard deviations above the mean decision speed for the sample.

z -scores have some interesting properties. First, if one converts (statisticians would say ‘transforms’) every available raw score in a sample to z -scores, the mean of these z -scores will always be zero and the standard deviation of these z -scores will always be 1.0. These two facts about z -scores (mean = 0; standard deviation = 1) will be true no matter what sample you are dealing with and no matter what the original units of measurement are (e.g. seconds, percentages, number of widgets assembled, amount of preference for a product, attitude rating, amount of money spent). This is because transforming raw scores to z -scores automatically changes the measurement units from whatever they originally were to a new system of measurements expressed in standard deviation units.

Suppose Maree was interested in the performance statistics for the top 25% most accurate quality control inspectors in the sample. Given a sample size of 112, this would mean finding the top 28 inspectors in terms of their accuracy scores. Since Maree is interested in performance statistics, speed scores would also be of interest. Table 5.5 (generated using the SPSS Descriptives … procedure, listed using the Case Summaries … procedure and formatted for presentation using Excel) shows accuracy and speed scores for the top 28 inspectors in descending order of accuracy scores. The z -score transformation for each of these scores is also shown (last two columns) as are the type of company, education level and gender for each inspector.

Listing of the 28 (top 25%) most accurate QCI inspectors’ accuracy and speed scores as well as standard ( z ) score transformations for each score

There are three inspectors (8, 9 and 14) who scored maximum accuracy of 100%. Such accuracy converts to a z -score of +1.95. Thus 100% accuracy is 1.95 standard deviations above the sample’s mean accuracy level. Interestingly, all three inspectors worked for PC manufacturers and all three had only high school-level education. The least accurate inspector in the top 25% had a z -score for accuracy that was .75 standard deviations above the sample mean.

Interestingly, the top three inspectors in terms of accuracy had decision speeds that fell below the sample’s mean speed; inspector 8 was the fastest inspector of the three with a speed just over 1 standard deviation ( z  = −1.03) below the sample mean. The slowest inspector in the top 25% was inspector 75 (case #28 in the list) with a speed z -score of +2.62; i.e., he was over two and a half standard deviations slower in making inspection decisions relative to the sample’s mean speed.

The fact that z -scores always have a common measurement scale having a mean of 0 and a standard deviation of 1.0 leads to an interesting application of standard scores. Suppose we focus on inspector number 65 (case #8 in the list) in Table 5.5 . It might be of interest to compare this inspector’s quality control performance in terms of both his decision accuracy and decision speed. Such a comparison is impossible using raw scores since the inspector’s accuracy score and speed scores are different measures which have differing means and standard deviations expressed in fundamentally different units of measurement (percentages and seconds). However, if we are willing to assume that the score distributions for both variables are approximately the same shape and that both accuracy and speed are measured with about the same level of reliability or consistency (see Procedure 10.1007/978-981-15-2537-7_8#Sec1), we can compare the inspector’s two scores by first converting them to z -scores within their own respective distributions as shown in Table 5.5 .

Inspector 65 looks rather anomalous in that he demonstrated a relatively high level of accuracy (raw score = 94%; z  = +1.29) but took a very long time to make those accurate decisions (raw score = 11.94 s; z  = +2.58). Contrast this with inspector 106 (case #17 in the list) who demonstrated a similar level of accuracy (raw score = 92%; z  = +1.08) but took a much shorter time to make those accurate decisions (raw score = 1.70 s; z  = −.96). In terms of evaluating performance, from a company perspective, we might conclude that inspector 106 is performing at an overall higher level than inspector 65 because he can achieve a very high level of accuracy but much more quickly; accurate and fast is more cost effective and efficient than accurate and slow.

Note: We should be cautious here since we know from our previous explorations of the speed variable in Procedure 5.6 , that accuracy scores look fairly symmetrical and speed scores are positively skewed, so assuming that the two variables have the same distribution shape, so that z -score comparisons are permitted, would be problematic.

You might have noticed that as you scanned down the two columns of z -scores in Table 5.5 , there was a suggestion of a pattern between the signs attached to the respective z -scores for each person. There seems to be a very slight preponderance of pairs of z -scores where the signs are reversed (12 out of 22 pairs). This observation provides some very preliminary evidence to suggest that there may be a relationship between inspection accuracy and decision speed, namely that a more accurate decision tends to be associated with a faster decision speed. Of course, this pattern would be better verified using the entire sample rather than the top 25% of inspectors. However, you may find it interesting to learn that it is precisely this sort of suggestive evidence (about agreement or disagreement between z -score signs for pairs of variable scores throughout a sample) that is captured and summarised by a single statistical indicator called a ‘correlation coefficient’ (see Fundamental Concept 10.1007/978-981-15-2537-7_6#Sec1 and Procedure 10.1007/978-981-15-2537-7_6#Sec4).

z -scores are not the only type of standard score that is commonly used. Three other types of standard scores are: stanines (standard nines), IQ scores and T-scores (not to be confused with the t -test described in Procedure 10.1007/978-981-15-2537-7_7#Sec22). These other types of scores have the advantage of producing only positive integer scores rather than positive and negative decimal scores. This makes interpretation somewhat easier for certain applications. However, you should know that almost all other types of standard scores come from a specific transformation of z -scores. This is because once you have converted raw scores into z -scores, they can then be quite readily transformed into any other system of measurement by simply multiplying a person’s z -score by the new desired standard deviation for the measure and adding to that product the new desired mean for the measure.

T-scores are simply z-scores transformed to have a mean of 50.0 and a standard deviation of 10.0; IQ scores are simply z-scores transformed to have a mean of 100 and a standard deviation of 15 (or 16 in some systems). For more information, see Fundamental Concept II .

Standard scores are useful for representing the position of each raw score within a sample distribution relative to the mean of that distribution. The unit of measurement becomes the number of standard deviations a specific score is away from the sample mean. As such, z -scores can permit cautious comparisons across samples or across different variables having vastly differing means and standard deviations within the constraints of the comparison samples having similarly shaped distributions and roughly equivalent levels of measurement reliability. z -scores also form the basis for establishing the degree of correlation between two variables. Transforming raw scores into z -scores does not change the shape of a distribution or rank ordering of individuals within that distribution. For this reason, a z -score is referred to as a linear transformation of a raw score. Interestingly, z -scores provide an important foundational element for more complex analytical procedures such as factor analysis ( Procedure 10.1007/978-981-15-2537-7_6#Sec36), cluster analysis ( Procedure 10.1007/978-981-15-2537-7_6#Sec41) and multiple regression analysis (see, for example, Procedure 10.1007/978-981-15-2537-7_6#Sec27 and 10.1007/978-981-15-2537-7_7#Sec86).

While standard scores are useful indices, they are subject to restrictions if used to compare scores across samples or across different variables. The samples must have similar distribution shapes for the comparisons to be meaningful and the measures must have similar levels of reliability in each sample. The groups used to generate the z -scores should also be similar in composition (with respect to age, gender distribution, and so on). Because z -scores are not an intuitively meaningful way of presenting scores to lay-persons, many other types of standard score schemes have been devised to improve interpretability. However, most of these schemes produce scores that run a greater risk of facilitating lay-person misinterpretations simply because their connection with z -scores is hidden or because the resulting numbers ‘look’ like a more familiar type of score which people do intuitively understand.

It is extremely rare for a T-score to exceed 100 or go below 0 because this would mean that the raw score was in excess of 5 standard deviations away from the sample mean. This unfortunately means that T-scores are often misinterpreted as percentages because they typically range between 0 and 100 and therefore ‘look’ like percentages. However, T-scores are definitely not percentages.

Finally, a common misunderstanding of z -scores is that transforming raw scores into z -scores makes them follow a normal distribution (see Fundamental Concept II ). This is not the case. The distribution of z -scores will have exactly the same shape as that for the raw scores; if the raw scores are positively skewed, then the corresponding z -scores will also be positively skewed.

z -scores are particularly useful in evaluative studies where relative performance indices are of interest. Whenever you compute a correlation coefficient ( Procedure 10.1007/978-981-15-2537-7_6#Sec4), you are implicitly transforming the two variables involved into z -scores (which equates the variables in terms of mean and standard deviation), so that only the patterning in the relationship between the variables is represented. z -scores are also useful as a preliminary step to more advanced parametric statistical methods when variables differing in scale, range and/or measurement units must be equated for means and standard deviations prior to analysis.

Fundamental Concept II: The Normal Distribution

Arguably the most fundamental distribution used in the statistical analysis of quantitative data in the behavioural and social sciences is the normal distribution (also known as the Gaussian or bell-shaped distribution ). Many behavioural phenomena, if measured on a large enough sample of people, tend to produce ‘normally distributed’ variable scores. This includes most measures of ability, performance and productivity, personality characteristics and attitudes. The normal distribution is important because it is the one form of distribution that you must assume describes the scores of a variable in the population when parametric tests of statistical inference are undertaken. The standard normal distribution is defined as having a population mean of 0.0 and a population standard deviation of 1.0. The normal distribution is also important as a means of interpreting various types of scoring systems.

Figure 5.28 displays the standard normal distribution (mean = 0; standard deviation = 1.0) and shows that there is a clear link between z -scores and the normal distribution. Statisticians have analytically calculated the probability (also expressed as percentages or percentiles) that observations will fall above or below any specific z -score in the theoretical standard normal distribution. Thus, a z -score of +1.0 in the standard normal distribution will have 84.13% (equals a probability of .8413) of observations in the population falling at or below one standard deviation above the mean and 15.87% falling above that point. A z -score of −2.0 will have 2.28% of observations falling at that point or below and 97.72% of observations falling above that point. It is clear then that, in a standard normal distribution, z -scores have a direct relationship with percentiles .

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig28_HTML.jpg

The normal (bell-shaped or Gaussian) distribution

Figure 5.28 also shows how T-scores relate to the standard normal distribution and to z -scores. The mean T-score falls at 50 and each increment or decrement of 10 T-score units means a movement of another standard deviation away from this mean of 50. Thus, a T-score of 80 corresponds to a z -score of +3.0—a score 3 standard deviations higher than the mean of 50.

Of special interest to behavioural researchers are the values for z -scores in a standard normal distribution that encompass 90% of observations ( z  = ±1.645—isolating 5% of the distribution in each tail), 95% of observations ( z  = ±1.96—isolating 2.5% of the distribution in each tail), and 99% of observations ( z  = ±2.58—isolating 0.5% of the distribution in each tail).

Depending upon the degree of certainty required by the researcher, these bands describe regions outside of which one might define an observation as being atypical or as perhaps not belonging to a distribution being centred at a mean of 0.0. Most often, what is taken as atypical or rare in the standard normal distribution is a score at least two standard deviations away from the mean, in either direction. Why choose two standard deviations? Since in the standard normal distribution, only about 5% of observations will fall outside a band defined by z -scores of ±1.96 (rounded to 2 for simplicity), this equates to data values that are 2 standard deviations away from their mean. This can give us a defensible way to identify outliers or extreme values in a distribution.

Thinking ahead to what you will encounter in Chap. 10.1007/978-981-15-2537-7_7, this ‘banding’ logic can be extended into the world of statistics (like means and percentages) as opposed to just the world of observations. You will frequently hear researchers speak of some statistic estimating a specific value (a parameter ) in a population, plus or minus some other value.

A survey organisation might report political polling results in terms of a percentage and an error band, e.g. 59% of Australians indicated that they would vote Labour at the next federal election, plus or minus 2%.

Most commonly, this error band (±2%) is defined by possible values for the population parameter that are about two standard deviations (or two standard errors—a concept discussed further in Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec14) away from the reported or estimated statistical value. In effect, the researcher is saying that on 95% of the occasions he/she would theoretically conduct his/her study, the population value estimated by the statistic being reported would fall between the limits imposed by the endpoints of the error band (the official name for this error band is a confidence interval ; see Procedure 10.1007/978-981-15-2537-7_8#Sec18). The well-understood mathematical properties of the standard normal distribution are what make such precise statements about levels of error in statistical estimates possible.

Checking for Normality

It is important to understand that transforming the raw scores for a variable to z -scores (recall Procedure 5.7 ) does not produce z -scores which follow a normal distribution; rather they will have the same distributional shape as the original scores. However, if you are willing to assume that the normal distribution is the correct reference distribution in the population, then you are justified is interpreting z -scores in light of the known characteristics of the normal distribution.

In order to justify this assumption, not only to enhance the interpretability of z -scores but more generally to enhance the integrity of parametric statistical analyses, it is helpful to actually look at the sample frequency distributions for variables (using a histogram (illustrated in Procedure 5.2 ) or a boxplot (illustrated in Procedure 5.6 ), for example), since non-normality can often be visually detected. It is important to note that in the social and behavioural sciences as well as in economics and finance, certain variables tend to be non-normal by their very nature. This includes variables that measure time taken to complete a task, achieve a goal or make decisions and variables that measure, for example, income, occurrence of rare or extreme events or organisational size. Such variables tend to be positively skewed in the population, a pattern that can often be confirmed by graphing the distribution.

If you cannot justify an assumption of ‘normality’, you may be able to force the data to be normally distributed by using what is called a ‘normalising transformation’. Such transformations will usually involve a nonlinear mathematical conversion (such as computing the logarithm, square root or reciprocal) of the raw scores. Such transformations will force the data to take on a more normal appearance so that the assumption of ‘normality’ can be reasonably justified, but at the cost of creating a new variable whose units of measurement and interpretation are more complicated. [For some non-normal variables, such as the occurrence of rare, extreme or catastrophic events (e.g. a 100-year flood or forest fire, coronavirus pandemic, the Global Financial Crisis or other type of financial crisis, man-made or natural disaster), the distributions cannot be ‘normalised’. In such cases, the researcher needs to model the distribution as it stands. For such events, extreme value theory (e.g. see Diebold et al. 2000 ) has proven very useful in recent years. This theory uses a variation of the Pareto or Weibull distribution as a reference, rather than the normal distribution, when making predictions.]

Figure 5.29 displays before and after pictures of the effects of a logarithmic transformation on the positively skewed speed variable from the QCI database. Each graph, produced using NCSS, is of the hybrid histogram-density trace-boxplot type first illustrated in Procedure 5.6 . The left graph clearly shows the strong positive skew in the speed scores and the right graph shows the result of taking the log 10 of each raw score.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig29_HTML.jpg

Combined histogram-density trace-boxplot graphs displaying the before and after effects of a ‘normalising’ log 10 transformation of the speed variable

Notice how the long tail toward slow speed scores is pulled in toward the mean and the very short tail toward fast speed scores is extended away from the mean. The result is a more ‘normal’ appearing distribution. The assumption would then be that we could assume normality of speed scores, but only in a log 10 format (i.e. it is the log of speed scores that we assume is normally distributed in the population). In general, taking the logarithm of raw scores provides a satisfactory remedy for positively skewed distributions (but not for negatively skewed ones). Furthermore, anything we do with the transformed speed scores now has to be interpreted in units of log 10 (seconds) which is a more complex interpretation to make.

Another visual method for detecting non-normality is to graph what is called a normal Q-Q plot (the Q-Q stands for Quantile-Quantile). This plots the percentiles for the observed data against the percentiles for the standard normal distribution (see Cleveland 1995 for more detailed discussion; also see Lane 2007 , http://onlinestatbook.com/2/advanced_graphs/ q-q_plots.html) . If the pattern for the observed data follows a normal distribution, then all the points on the graph will fall approximately along a diagonal line.

Figure 5.30 shows the normal Q-Q plots for the original speed variable and the transformed log-speed variable, produced using the SPSS Explore... procedure. The diagnostic diagonal line is shown on each graph. In the left-hand plot, for speed , the plot points clearly deviate from the diagonal in a way that signals positive skewness. The right-hand plot, for log_speed, shows the plot points generally falling along the diagonal line thereby conforming much more closely to what is expected in a normal distribution.

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Fig30_HTML.jpg

Normal Q-Q plots for the original speed variable and the new log_speed variable

In addition to visual ways of detecting non-normality, there are also numerical ways. As highlighted in Chap. 10.1007/978-981-15-2537-7_1, there are two additional characteristics of any distribution, namely skewness (asymmetric distribution tails) and kurtosis (peakedness of the distribution). Both have an associated statistic that provides a measure of that characteristic, similar to the mean and standard deviation statistics. In a normal distribution, the values for the skewness and kurtosis statistics are both zero (skewness = 0 means a symmetric distribution; kurtosis = 0 means a mesokurtic distribution). The further away each statistic is from zero, the more the distribution deviates from a normal shape. Both the skewness statistic and the kurtosis statistic have standard errors (see Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec14) associated with them (which work very much like the standard deviation, only for a statistic rather than for observations); these can be routinely computed by almost any statistical package when you request a descriptive analysis. Without going into the logic right now (this will come in Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec1), a rough rule of thumb you can use to check for normality using the skewness and kurtosis statistics is to do the following:

  • Prepare : Take the standard error for the statistic and multiply it by 2 (or 3 if you want to be more conservative).
  • Interval : Add the result from the Prepare step to the value of the statistic and subtract the result from the value of the statistic. You will end up with two numbers, one low - one high, that define the ends of an interval (what you have just created approximates what is called a ‘confidence interval’, see Procedure 10.1007/978-981-15-2537-7_8#Sec18).
  • Check : If zero falls inside of this interval (i.e. between the low and high endpoints from the Interval step), then there is likely to be no significant issue with that characteristic of the distribution. If zero falls outside of the interval (i.e. lower than the low value endpoint or higher than the high value endpoint), then you likely have an issue with non-normality with respect to that characteristic.

Visually, we saw in the left graph in Fig. 5.29 that the speed variable was highly positively skewed. What if Maree wanted to check some numbers to support this judgment? She could ask SPSS to produce the skewness and kurtosis statistics for both the original speed variable and the new log_speed variable using the Frequencies... or the Explore... procedure. Table 5.6 shows what SPSS would produce if the Frequencies ... procedure were used.

Skewness and kurtosis statistics and their standard errors for both the original speed variable and the new log_speed variable

An external file that holds a picture, illustration, etc.
Object name is 489638_3_En_5_Tab6_HTML.jpg

Using the 3-step check rule described above, Maree could roughly evaluate the normality of the two variables as follows:

  • skewness : [Prepare] 2 × .229 = .458 ➔ [Interval] 1.487 − .458 = 1.029 and 1.487 + .458 = 1.945 ➔ [Check] zero does not fall inside the interval bounded by 1.029 and 1.945, so there appears to be a significant problem with skewness. Since the value for the skewness statistic (1.487) is positive, this means the problem is positive skewness, confirming what the left graph in Fig. 5.29 showed.
  • kurtosis : [Prepare] 2 × .455 = .91 ➔ [Interval] 3.071 − .91 = 2.161 and 3.071 + .91 = 3.981 ➔ [Check] zero does not fall in interval bounded by 2.161 and 3.981, so there appears to be a significant problem with kurtosis. Since the value for the kurtosis statistic (1.487) is positive, this means the problem is leptokurtosis—the peakedness of the distribution is too tall relative to what is expected in a normal distribution.
  • skewness : [Prepare] 2 × .229 = .458 ➔ [Interval] −.050 − .458 = −.508 and −.050 + .458 = .408 ➔ [Check] zero falls within interval bounded by −.508 and .408, so there appears to be no problem with skewness. The log transform appears to have corrected the problem, confirming what the right graph in Fig. 5.29 showed.
  • kurtosis : [Prepare] 2 × .455 = .91 ➔ [Interval] −.672 – .91 = −1.582 and −.672 + .91 = .238 ➔ [Check] zero falls within interval bounded by −1.582 and .238, so there appears to be no problem with kurtosis. The log transform appears to have corrected this problem as well, rendering the distribution more approximately mesokurtic (i.e. normal) in shape.

There are also more formal tests of significance (see Fundamental Concept 10.1007/978-981-15-2537-7_7#Sec1) that one can use to numerically evaluate normality, such as the Kolmogorov-Smirnov test and the Shapiro-Wilk’s test . Each of these tests, for example, can be produced by SPSS on request, via the Explore... procedure.

1 For more information, see Chap. 10.1007/978-981-15-2537-7_1 – The language of statistics .

References for Procedure 5.1

  • Allen P, Bennett K, Heritage B. SPSS statistics: A practical guide. 4. South Melbourne, VIC: Cengage Learning Australia Pty; 2019. [ Google Scholar ]
  • George D, Mallery P. IBM SPSS statistics 25 step by step: A simple guide and reference. 15. New York: Routledge; 2019. [ Google Scholar ]

Useful Additional Readings for Procedure 5.1

  • Agresti A. Statistical methods for the social sciences. 5. Boston: Pearson; 2018. [ Google Scholar ]
  • Argyrous G. Statistics for research: With a guide to SPSS. 3. London: Sage; 2011. [ Google Scholar ]
  • De Vaus D. Analyzing social science data: 50 key problems in data analysis. London: Sage; 2002. [ Google Scholar ]
  • Glass GV, Hopkins KD. Statistical methods in education and psychology. 3. Upper Saddle River, NJ: Pearson; 1996. [ Google Scholar ]
  • Gravetter FJ, Wallnau LB. Statistics for the behavioural sciences. 10. Belmont, CA: Wadsworth Cengage; 2017. [ Google Scholar ]
  • Steinberg WJ. Statistics alive. 2. Los Angeles: Sage; 2011. [ Google Scholar ]

References for Procedure 5.2

  • Chang W. R graphics cookbook: Practical recipes for visualizing data. 2. Sebastopol, CA: O’Reilly Media; 2019. [ Google Scholar ]
  • Jacoby WG. Statistical graphics for univariate and bivariate data. Thousand Oaks, CA: Sage; 1997. [ Google Scholar ]
  • McCandless D. Knowledge is beautiful. London: William Collins; 2014. [ Google Scholar ]
  • Smithson MJ. Statistics with confidence. London: Sage; 2000. [ Google Scholar ]
  • Toseland M, Toseland S. Infographica: The world as you have never seen it before. London: Quercus Books; 2012. [ Google Scholar ]
  • Wilkinson L. Cognitive science and graphic design. In: SYSTAT Software Inc, editor. SYSTAT 13: Graphics. Chicago, IL: SYSTAT Software Inc; 2009. pp. 1–21. [ Google Scholar ]

Useful Additional Readings for Procedure 5.2

  • Field A. Discovering statistics using SPSS for windows. 5. Los Angeles: Sage; 2018. [ Google Scholar ]
  • George D, Mallery P. IBM SPSS statistics 25 step by step: A simple guide and reference. 15. Boston, MA: Pearson Education; 2019. [ Google Scholar ]
  • Hintze JL. NCSS 8 help system: Graphics. Kaysville, UT: Number Cruncher Statistical Systems; 2012. [ Google Scholar ]
  • StatPoint Technologies, Inc . STATGRAPHICS Centurion XVI user manual. Warrenton, VA: StatPoint Technologies Inc.; 2010. [ Google Scholar ]
  • SYSTAT Software Inc . SYSTAT 13: Graphics. Chicago, IL: SYSTAT Software Inc; 2009. [ Google Scholar ]

References for Procedure 5.3

  • Cleveland WR. Visualizing data. Summit, NJ: Hobart Press; 1995. [ Google Scholar ]
  • Jacoby WJ. Statistical graphics for visualizing multivariate data. Thousand Oaks, CA: Sage; 1998. [ Google Scholar ]

Useful Additional Readings for Procedure 5.3

  • Kirk A. Data visualisation: A handbook for data driven design. Los Angeles: Sage; 2016. [ Google Scholar ]
  • Knaflic CN. Storytelling with data: A data visualization guide for business professionals. Hoboken, NJ: Wiley; 2015. [ Google Scholar ]
  • Tufte E. The visual display of quantitative information. 2. Cheshire, CN: Graphics Press; 2001. [ Google Scholar ]

Reference for Procedure 5.4

Useful additional readings for procedure 5.4.

  • Rosenthal R, Rosnow RL. Essentials of behavioral research: Methods and data analysis. 2. New York: McGraw-Hill Inc; 1991. [ Google Scholar ]

References for Procedure 5.5

Useful additional readings for procedure 5.5.

  • Gravetter FJ, Wallnau LB. Statistics for the behavioural sciences. 9. Belmont, CA: Wadsworth Cengage; 2012. [ Google Scholar ]

References for Fundamental Concept I

Useful additional readings for fundamental concept i.

  • Howell DC. Statistical methods for psychology. 8. Belmont, CA: Cengage Wadsworth; 2013. [ Google Scholar ]

References for Procedure 5.6

  • Norušis MJ. IBM SPSS statistics 19 guide to data analysis. Upper Saddle River, NJ: Prentice Hall; 2012. [ Google Scholar ]
  • Field A. Discovering statistics using SPSS for Windows. 5. Los Angeles: Sage; 2018. [ Google Scholar ]
  • Hintze JL. NCSS 8 help system: Introduction. Kaysville, UT: Number Cruncher Statistical System; 2012. [ Google Scholar ]
  • SYSTAT Software Inc . SYSTAT 13: Statistics - I. Chicago, IL: SYSTAT Software Inc; 2009. [ Google Scholar ]

Useful Additional Readings for Procedure 5.6

  • Hartwig F, Dearing BE. Exploratory data analysis. Beverly Hills, CA: Sage; 1979. [ Google Scholar ]
  • Leinhardt G, Leinhardt L. Exploratory data analysis. In: Keeves JP, editor. Educational research, methodology, and measurement: An international handbook. 2. Oxford: Pergamon Press; 1997. pp. 519–528. [ Google Scholar ]
  • Rosenthal R, Rosnow RL. Essentials of behavioral research: Methods and data analysis. 2. New York: McGraw-Hill, Inc.; 1991. [ Google Scholar ]
  • Tukey JW. Exploratory data analysis. Reading, MA: Addison-Wesley Publishing; 1977. [ Google Scholar ]
  • Velleman PF, Hoaglin DC. ABC’s of EDA. Boston: Duxbury Press; 1981. [ Google Scholar ]

Useful Additional Readings for Procedure 5.7

References for fundemental concept ii.

  • Diebold FX, Schuermann T, Stroughair D. Pitfalls and opportunities in the use of extreme value theory in risk management. The Journal of Risk Finance. 2000; 1 (2):30–35. doi: 10.1108/eb043443. [ CrossRef ] [ Google Scholar ]
  • Lane D. Online statistics education: A multimedia course of study. Houston, TX: Rice University; 2007. [ Google Scholar ]

Useful Additional Readings for Fundemental Concept II

  • Keller DK. The tao of statistics: A path to understanding (with no math) Thousand Oaks, CA: Sage; 2006. [ Google Scholar ]
  • Open access
  • Published: 19 April 2024

A scoping review of continuous quality improvement in healthcare system: conceptualization, models and tools, barriers and facilitators, and impact

  • Aklilu Endalamaw 1 , 2 ,
  • Resham B Khatri 1 , 3 ,
  • Tesfaye Setegn Mengistu 1 , 2 ,
  • Daniel Erku 1 , 4 , 5 ,
  • Eskinder Wolka 6 ,
  • Anteneh Zewdie 6 &
  • Yibeltal Assefa 1  

BMC Health Services Research volume  24 , Article number:  487 ( 2024 ) Cite this article

841 Accesses

Metrics details

The growing adoption of continuous quality improvement (CQI) initiatives in healthcare has generated a surge in research interest to gain a deeper understanding of CQI. However, comprehensive evidence regarding the diverse facets of CQI in healthcare has been limited. Our review sought to comprehensively grasp the conceptualization and principles of CQI, explore existing models and tools, analyze barriers and facilitators, and investigate its overall impacts.

This qualitative scoping review was conducted using Arksey and O’Malley’s methodological framework. We searched articles in PubMed, Web of Science, Scopus, and EMBASE databases. In addition, we accessed articles from Google Scholar. We used mixed-method analysis, including qualitative content analysis and quantitative descriptive for quantitative findings to summarize findings and PRISMA extension for scoping reviews (PRISMA-ScR) framework to report the overall works.

A total of 87 articles, which covered 14 CQI models, were included in the review. While 19 tools were used for CQI models and initiatives, Plan-Do-Study/Check-Act cycle was the commonly employed model to understand the CQI implementation process. The main reported purposes of using CQI, as its positive impact, are to improve the structure of the health system (e.g., leadership, health workforce, health technology use, supplies, and costs), enhance healthcare delivery processes and outputs (e.g., care coordination and linkages, satisfaction, accessibility, continuity of care, safety, and efficiency), and improve treatment outcome (reduce morbidity and mortality). The implementation of CQI is not without challenges. There are cultural (i.e., resistance/reluctance to quality-focused culture and fear of blame or punishment), technical, structural (related to organizational structure, processes, and systems), and strategic (inadequate planning and inappropriate goals) related barriers that were commonly reported during the implementation of CQI.

Conclusions

Implementing CQI initiatives necessitates thoroughly comprehending key principles such as teamwork and timeline. To effectively address challenges, it’s crucial to identify obstacles and implement optimal interventions proactively. Healthcare professionals and leaders need to be mentally equipped and cognizant of the significant role CQI initiatives play in achieving purposes for quality of care.

Peer Review reports

Continuous quality improvement (CQI) initiative is a crucial initiative aimed at enhancing quality in the health system that has gradually been adopted in the healthcare industry. In the early 20th century, Shewhart laid the foundation for quality improvement by describing three essential steps for process improvement: specification, production, and inspection [ 1 , 2 ]. Then, Deming expanded Shewhart’s three-step model into ‘plan, do, study/check, and act’ (PDSA or PDCA) cycle, which was applied to management practices in Japan in the 1950s [ 3 ] and was gradually translated into the health system. In 1991, Kuperman applied a CQI approach to healthcare, comprising selecting a process to be improved, assembling a team of expert clinicians that understands the process and the outcomes, determining key steps in the process and expected outcomes, collecting data that measure the key process steps and outcomes, and providing data feedback to the practitioners [ 4 ]. These philosophies have served as the baseline for the foundation of principles for continuous improvement [ 5 ].

Continuous quality improvement fosters a culture of continuous learning, innovation, and improvement. It encourages proactive identification and resolution of problems, promotes employee engagement and empowerment, encourages trust and respect, and aims for better quality of care [ 6 , 7 ]. These characteristics drive the interaction of CQI with other quality improvement projects, such as quality assurance and total quality management [ 8 ]. Quality assurance primarily focuses on identifying deviations or errors through inspections, audits, and formal reviews, often settling for what is considered ‘good enough’, rather than pursuing the highest possible standards [ 9 , 10 ], while total quality management is implemented as the management philosophy and system to improve all aspects of an organization continuously [ 11 ].

Continuous quality improvement has been implemented to provide quality care. However, providing effective healthcare is a complicated and complex task in achieving the desired health outcomes and the overall well-being of individuals and populations. It necessitates tackling issues, including access, patient safety, medical advances, care coordination, patient-centered care, and quality monitoring [ 12 , 13 ], rooted long ago. It is assumed that the history of quality improvement in healthcare started in 1854 when Florence Nightingale introduced quality improvement documentation [ 14 ]. Over the passing decades, Donabedian introduced structure, processes, and outcomes as quality of care components in 1966 [ 15 ]. More comprehensively, the Institute of Medicine in the United States of America (USA) has identified effectiveness, efficiency, equity, patient-centredness, safety, and timeliness as the components of quality of care [ 16 ]. Moreover, quality of care has recently been considered an integral part of universal health coverage (UHC) [ 17 ], which requires initiatives to mobilise essential inputs [ 18 ].

While the overall objective of CQI in health system is to enhance the quality of care, it is important to note that the purposes and principles of CQI can vary across different contexts [ 19 , 20 ]. This variation has sparked growing research interest. For instance, a review of CQI approaches for capacity building addressed its role in health workforce development [ 21 ]. Another systematic review, based on random-controlled design studies, assessed the effectiveness of CQI using training as an intervention and the PDSA model [ 22 ]. As a research gap, the former review was not directly related to the comprehensive elements of quality of care, while the latter focused solely on the impact of training using the PDSA model, among other potential models. Additionally, a review conducted in 2015 aimed to identify barriers and facilitators of CQI in Canadian contexts [ 23 ]. However, all these reviews presented different perspectives and investigated distinct outcomes. This suggests that there is still much to explore in terms of comprehensively understanding the various aspects of CQI initiatives in healthcare.

As a result, we conducted a scoping review to address several aspects of CQI. Scoping reviews serve as a valuable tool for systematically mapping the existing literature on a specific topic. They are instrumental when dealing with heterogeneous or complex bodies of research. Scoping reviews provide a comprehensive overview by summarizing and disseminating findings across multiple studies, even when evidence varies significantly [ 24 ]. In our specific scoping review, we included various types of literature, including systematic reviews, to enhance our understanding of CQI.

This scoping review examined how CQI is conceptualized and measured and investigated models and tools for its application while identifying implementation challenges and facilitators. It also analyzed the purposes and impact of CQI on the health systems, providing valuable insights for enhancing healthcare quality.

Protocol registration and results reporting

Protocol registration for this scoping review was not conducted. Arksey and O’Malley’s methodological framework was utilized to conduct this scoping review [ 25 ]. The scoping review procedures start by defining the research questions, identifying relevant literature, selecting articles, extracting data, and summarizing the results. The review findings are reported using the PRISMA extension for a scoping review (PRISMA-ScR) [ 26 ]. McGowan and colleagues also advised researchers to report findings from scoping reviews using PRISMA-ScR [ 27 ].

Defining the research problems

This review aims to comprehensively explore the conceptualization, models, tools, barriers, facilitators, and impacts of CQI within the healthcare system worldwide. Specifically, we address the following research questions: (1) How has CQI been defined across various contexts? (2) What are the diverse approaches to implementing CQI in healthcare settings? (3) Which tools are commonly employed for CQI implementation ? (4) What barriers hinder and facilitators support successful CQI initiatives? and (5) What effects CQI initiatives have on the overall care quality?

Information source and search strategy

We conducted the search in PubMed, Web of Science, Scopus, and EMBASE databases, and the Google Scholar search engine. The search terms were selected based on three main distinct concepts. One group was CQI-related terms. The second group included terms related to the purpose for which CQI has been implemented, and the third group included processes and impact. These terms were selected based on the Donabedian framework of structure, process, and outcome [ 28 ]. Additionally, the detailed keywords were recruited from the primary health framework, which has described lists of dimensions under process, output, outcome, and health system goals of any intervention for health [ 29 ]. The detailed search strategy is presented in the Supplementary file 1 (Search strategy). The search for articles was initiated on August 12, 2023, and the last search was conducted on September 01, 2023.

Eligibility criteria and article selection

Based on the scoping review’s population, concept, and context frameworks [ 30 ], the population included any patients or clients. Additionally, the concepts explored in the review encompassed definitions, implementation, models, tools, barriers, facilitators, and impacts of CQI. Furthermore, the review considered contexts at any level of health systems. We included articles if they reported results of qualitative or quantitative empirical study, case studies, analytic or descriptive synthesis, any review, and other written documents, were published in peer-reviewed journals, and were designed to address at least one of the identified research questions or one of the identified implementation outcomes or their synonymous taxonomy as described in the search strategy. Based on additional contexts, we included articles published in English without geographic and time limitations. We excluded articles with abstracts only, conference abstracts, letters to editors, commentators, and corrections.

We exported all citations to EndNote x20 to remove duplicates and screen relevant articles. The article selection process includes automatic duplicate removal by using EndNote x20, unmatched title and abstract removal, citation and abstract-only materials removal, and full-text assessment. The article selection process was mainly conducted by the first author (AE) and reported to the team during the weekly meetings. The first author encountered papers that caused confusion regarding whether to include or exclude them and discussed them with the last author (YA). Then, decisions were ultimately made. Whenever disagreements happened, they were resolved by discussion and reconsideration of the review questions in relation to the written documents of the article. Further statistical analysis, such as calculating Kappa, was not performed to determine article inclusion or exclusion.

Data extraction and data items

We extracted first author, publication year, country, settings, health problem, the purpose of the study, study design, types of intervention if applicable, CQI approaches/steps if applicable, CQI tools and procedures if applicable, and main findings using a customized Microsoft Excel form.

Summarizing and reporting the results

The main findings were summarized and described based on the main themes, including concepts under conceptualizing, principles, teams, timelines, models, tools, barriers, facilitators, and impacts of CQI. Results-based convergent synthesis, achieved through mixed-method analysis, involved content analysis to identify the thematic presentation of findings. Additionally, a narrative description was used for quantitative findings, aligning them with the appropriate theme. The authors meticulously reviewed the primary findings from each included material and contextualized these findings concerning the main themes1. This approach provides a comprehensive understanding of complex interventions and health systems, acknowledging quantitative and qualitative evidence.

Search results

A total of 11,251 documents were identified from various databases: SCOPUS ( n  = 4,339), PubMed ( n  = 2,893), Web of Science ( n  = 225), EMBASE ( n  = 3,651), and Google Scholar ( n  = 143). After removing duplicates ( n  = 5,061), 6,190 articles were evaluated by title and abstract. Subsequently, 208 articles were assessed for full-text eligibility. Following the eligibility criteria, 121 articles were excluded, leaving 87 included in the current review (Fig.  1 ).

figure 1

Article selection process

Operationalizing continuous quality improvement

Continuous Quality Improvement (CQI) is operationalized as a cyclic process that requires commitment to implementation, teamwork, time allocation, and celebrating successes and failures.

CQI is a cyclic ongoing process that is followed reflexive, analytical and iterative steps, including identifying gaps, generating data, developing and implementing action plans, evaluating performance, providing feedback to implementers and leaders, and proposing necessary adjustments [ 31 , 32 , 33 , 34 , 35 , 36 , 37 , 38 ].

CQI requires committing to the philosophy, involving continuous improvement [ 19 , 38 ], establishing a mission statement [ 37 ], and understanding quality definition [ 19 ].

CQI involves a wide range of patient-oriented measures and performance indicators, specifically satisfying internal and external customers, developing quality assurance, adopting common quality measures, and selecting process measures [ 8 , 19 , 35 , 36 , 37 , 39 , 40 ].

CQI requires celebrating success and failure without personalization, leading each team member to develop error-free attitudes [ 19 ]. Success and failure are related to underlying organizational processes and systems as causes of failure rather than blaming individuals [ 8 ] because CQI is process-focused based on collaborative, data-driven, responsive, rigorous and problem-solving statistical analysis [ 8 , 19 , 38 ]. Furthermore, a gap or failure opens another opportunity for establishing a data-driven learning organization [ 41 ].

CQI cannot be implemented without a CQI team [ 8 , 19 , 37 , 39 , 42 , 43 , 44 , 45 , 46 ]. A CQI team comprises individuals from various disciplines, often comprising a team leader, a subject matter expert (physician or other healthcare provider), a data analyst, a facilitator, frontline staff, and stakeholders [ 39 , 43 , 47 , 48 , 49 ]. It is also important to note that inviting stakeholders or partners as part of the CQI support intervention is crucial [ 19 , 38 , 48 ].

The timeline is another distinct feature of CQI because the results of CQI vary based on the implementation duration of each cycle [ 35 ]. There is no specific time limit for CQI implementation, although there is a general consensus that a cycle of CQI should be relatively short [ 35 ]. For instance, a CQI implementation took 2 months [ 42 ], 4 months [ 50 ], 9 months [ 51 , 52 ], 12 months [ 53 , 54 , 55 ], and one year and 5 months [ 49 ] duration to achieve the desired positive outcome, while bi-weekly [ 47 ] and monthly data reviews and analyses [ 44 , 48 , 56 ], and activities over 3 months [ 57 ] have also resulted in a positive outcome.

Continuous quality improvement models and tools

There have been several models are utilized. The Plan-Do-Study/Check-Act cycle is a stepwise process involving project initiation, situation analysis, root cause identification, solution generation and selection, implementation, result evaluation, standardization, and future planning [ 7 , 36 , 37 , 45 , 47 , 48 , 49 , 50 , 51 , 53 , 56 , 57 , 58 , 59 , 60 , 61 , 62 , 63 , 64 , 65 , 66 , 67 , 68 , 69 , 70 ]. The FOCUS-PDCA cycle enhances the PDCA process by adding steps to find and improve a process (F), organize a knowledgeable team (O), clarify the process (C), understand variations (U), and select improvements (S) [ 55 , 71 , 72 , 73 ]. The FADE cycle involves identifying a problem (Focus), understanding it through data analysis (Analyze), devising solutions (Develop), and implementing the plan (Execute) [ 74 ]. The Logic Framework involves brainstorming to identify improvement areas, conducting root cause analysis to develop a problem tree, logically reasoning to create an objective tree, formulating the framework, and executing improvement projects [ 75 ]. Breakthrough series approach requires CQI teams to meet in quarterly collaborative learning sessions, share learning experiences, and continue discussion by telephone and cross-site visits to strengthen learning and idea exchange [ 47 ]. Another CQI model is the Lean approach, which has been conducted with Kaizen principles [ 52 ], 5 S principles, and the Six Sigma model. The 5 S (Sort, Set/Straighten, Shine, Standardize, Sustain) systematically organises and improves the workplace, focusing on sorting, setting order, shining, standardizing, and sustaining the improvement [ 54 , 76 ]. Kaizen principles guide CQI by advocating for continuous improvement, valuing all ideas, solving problems, focusing on practical, low-cost improvements, using data to drive change, acknowledging process defects, reducing variability and waste, recognizing every interaction as a customer-supplier relationship, empowering workers, responding to all ideas, and maintaining a disciplined workplace [ 77 ]. Lean Six Sigma, a CQI model, applies the DMAIC methodology, which involves defining (D) and measuring the problem (M), analyzing root causes (A), improving by finding solutions (I), and controlling by assessing process stability (C) [ 78 , 79 ]. The 5 C-cyclic model (consultation, collection, consideration, collaboration, and celebration), the first CQI framework for volunteer dental services in Aboriginal communities, ensures quality care based on community needs [ 80 ]. One study used meetings involving activities such as reviewing objectives, assigning roles, discussing the agenda, completing tasks, retaining key outputs, planning future steps, and evaluating the meeting’s effectiveness [ 81 ].

Various tools are involved in the implementation or evaluation of CQI initiatives: checklists [ 53 , 82 ], flowcharts [ 81 , 82 , 83 ], cause-and-effect diagrams (fishbone or Ishikawa diagrams) [ 60 , 62 , 79 , 81 , 82 ], fuzzy Pareto diagram [ 82 ], process maps [ 60 ], time series charts [ 48 ], why-why analysis [ 79 ], affinity diagrams and multivoting [ 81 ], and run chart [ 47 , 48 , 51 , 60 , 84 ], and others mentioned in the table (Table  1 ).

Barriers and facilitators of continuous quality improvement implementation

Implementing CQI initiatives is determined by various barriers and facilitators, which can be thematized into four dimensions. These dimensions are cultural, technical, structural, and strategic dimensions.

Continuous quality improvement initiatives face various cultural, strategic, technical, and structural barriers. Cultural dimension barriers involve resistance to change (e.g., not accepting online technology), lack of quality-focused culture, staff reporting apprehensiveness, and fear of blame or punishment [ 36 , 41 , 85 , 86 ]. The technical dimension barriers of CQI can include various factors that hinder the effective implementation and execution of CQI processes [ 36 , 86 , 87 , 88 , 89 ]. Structural dimension barriers of CQI arise from the organization structure, process, and systems that can impede the effective implementation and sustainability of CQI [ 36 , 85 , 86 , 87 , 88 ]. Strategic dimension barriers are, for example, the inability to select proper CQI goals and failure to integrate CQI into organizational planning and goals [ 36 , 85 , 86 , 87 , 88 , 90 ].

Facilitators are also grouped to cultural, structural, technical, and strategic dimensions to provide solutions to CQI barriers. Cultural challenges were addressed by developing a group culture to CQI and other rewards [ 39 , 41 , 80 , 85 , 86 , 87 , 90 , 91 , 92 ]. Technical facilitators are pivotal to improving technical barriers [ 39 , 42 , 53 , 69 , 86 , 90 , 91 ]. Structural-related facilitators are related to improving communication, infrastructure, and systems [ 86 , 92 , 93 ]. Strategic dimension facilitators include strengthening leadership and improving decision-making skills [ 43 , 53 , 67 , 86 , 87 , 92 , 94 , 95 ] (Table  2 ).

Impact of continuous quality improvement

Continuous quality improvement initiatives can significantly impact the quality of healthcare in a wide range of health areas, focusing on improving structure, the health service delivery process and improving client wellbeing and reducing mortality.

Structure components

These are health leadership, financing, workforce, technology, and equipment and supplies. CQI has improved planning, monitoring and evaluation [ 48 , 53 ], and leadership and planning [ 48 ], indicating improvement in leadership perspectives. Implementing CQI in primary health care (PHC) settings has shown potential for maintaining or reducing operation costs [ 67 ]. Findings from another study indicate that the costs associated with implementing CQI interventions per facility ranged from approximately $2,000 to $10,500 per year, with an average cost of approximately $10 to $60 per admitted client [ 57 ]. However, based on model predictions, the average cost savings after implementing CQI were estimated to be $5430 [ 31 ]. CQI can also be applied to health workforce development [ 32 ]. CQI in the institutional system improved medical education [ 66 , 96 , 97 ], human resources management [ 53 ], motivated staffs [ 76 ], and increased staff health awareness [ 69 ], while concerns raised about CQI impartiality, independence, and public accountability [ 96 ]. Regarding health technology, CQI also improved registration and documentation [ 48 , 53 , 98 ]. Furthermore, the CQI initiatives increased cleanliness [ 54 ] and improved logistics, supplies, and equipment [ 48 , 53 , 68 ].

Process and output components

The process component focuses on the activities and actions involved in delivering healthcare services.

Service delivery

CQI interventions improved service delivery [ 53 , 56 , 99 ], particularly a significant 18% increase in the overall quality of service performance [ 48 ], improved patient counselling, adherence to appropriate procedures, and infection prevention [ 48 , 68 ], and optimised workflow [ 52 ].

Coordination and collaboration

CQI initiatives improved coordination and collaboration through collecting and analysing data, onsite technical support, training, supportive supervision [ 53 ] and facilitating linkages between work processes and a quality control group [ 65 ].

Patient satisfaction

The CQI initiatives increased patient satisfaction and improved quality of life by optimizing care quality management, improving the quality of clinical nursing, reducing nursing defects and enhancing the wellbeing of clients [ 54 , 76 , 100 ], although CQI was not associated with changes in adolescent and young adults’ satisfaction [ 51 ].

CQI initiatives reduced medication error reports from 16 to 6 [ 101 ], and it significantly reduced the administration of inappropriate prophylactic antibiotics [ 44 ], decreased errors in inpatient care [ 52 ], decreased the overall episiotomy rate from 44.5 to 33.3% [ 83 ], reduced the overall incidence of unplanned endotracheal extubation [ 102 ], improving appropriate use of computed tomography angiography [ 103 ], and appropriate diagnosis and treatment selection [ 47 ].

Continuity of care

CQI initiatives effectively improve continuity of care by improving client and physician interaction. For instance, provider continuity levels showed a 64% increase [ 55 ]. Modifying electronic medical record templates, scheduling, staff and parental education, standardization of work processes, and birth to 1-year age-specific incentives in post-natal follow-up care increased continuity of care to 74% in 2018 compared to baseline 13% in 2012 [ 84 ].

The CQI initiative yielded enhanced efficiency in the cardiac catheterization laboratory, as evidenced by improved punctuality in procedure starts and increased efficiency in manual sheath-pulls inside [ 78 ].

Accessibility

CQI initiatives were effective in improving accessibility in terms of increasing service coverage and utilization rate. For instance, screening for cigarettes, nutrition counselling, folate prescription, maternal care, immunization coverage [ 53 , 81 , 104 , 105 ], reducing the percentage of non-attending patients to surgery to 0.9% from the baseline 3.9% [ 43 ], increasing Chlamydia screening rates from 29 to 60% [ 45 ], increasing HIV care continuum coverage [ 51 , 59 , 60 ], increasing in the uptake of postpartum long-acting reversible contraceptive use from 6.9% at the baseline to 25.4% [ 42 ], increasing post-caesarean section prophylaxis from 36 to 89% [ 62 ], a 31% increase of kangaroo care practice [ 50 ], and increased follow-up [ 65 ]. Similarly, the QI intervention increased the quality of antenatal care by 29.3%, correct partograph use by 51.7%, and correct active third-stage labour management, a 19.6% improvement from the baseline, but not significantly associated with improvement in contraceptive service uptake [ 61 ].

Timely access

CQI interventions improved the time care provision [ 52 ], and reduced waiting time [ 62 , 74 , 76 , 106 ]. For instance, the discharge process waiting time in the emergency department decreased from 76 min to 22 min [ 79 ]. It also reduced mean postprocedural length of stay from 2.8 days to 2.0 days [ 31 ].

Acceptability

Acceptability of CQI by healthcare providers was satisfactory. For instance, 88% of the faculty, 64% of the residents, and 82% of the staff believed CQI to be useful in the healthcare clinic [ 107 ].

Outcome components

Morbidity and mortality.

CQI efforts have demonstrated better management outcomes among diabetic patients [ 40 ], patients with oral mucositis [ 71 ], and anaemic patients [ 72 ]. It has also reduced infection rate in post-caesarean Sect. [ 62 ], reduced post-peritoneal dialysis peritonitis [ 49 , 108 ], and prevented pressure ulcers [ 70 ]. It is explained by peritonitis incidence from once every 40.1 patient months at baseline to once every 70.8 patient months after CQI [ 49 ] and a 63% reduction in pressure ulcer prevalence within 2 years from 2008 to 2010 [ 70 ]. Furthermore, CQI initiatives significantly reduced in-hospital deaths [ 31 ] and increased patient survival rates [ 108 ]. Figure  2 displays the overall process of the CQI implementations.

figure 2

The overall mechanisms of continuous quality improvement implementation

In this review, we examined the fundamental concepts and principles underlying CQI, the factors that either hinder or assist in its successful application and implementation, and the purpose of CQI in enhancing quality of care across various health issues.

Our findings have brought attention to the application and implementation of CQI, emphasizing its underlying concepts and principles, as evident in the existing literature [ 31 , 32 , 33 , 34 , 35 , 36 , 39 , 40 , 43 , 45 , 46 ]. Continuous quality improvement has shared with the principles of continuous improvement, such as a customer-driven focus, effective leadership, active participation of individuals, a process-oriented approach, systematic implementation, emphasis on design improvement and prevention, evidence-based decision-making, and fostering partnership [ 5 ]. Moreover, Deming’s 14 principles laid the foundation for CQI principles [ 109 ]. These principles have been adapted and put into practice in various ways: ten [ 19 ] and five [ 38 ] principles in hospitals, five principles for capacity building [ 38 ], and two principles for medication error prevention [ 41 ]. As a principle, the application of CQI can be process-focused [ 8 , 19 ] or impact-focused [ 38 ]. Impact-focused CQI focuses on achieving specific outcomes or impacts, whereas process-focused CQI prioritizes and improves the underlying processes and systems. These principles complement each other and can be utilized based on the objectives of quality improvement initiatives in healthcare settings. Overall, CQI is an ongoing educational process that requires top management’s involvement, demands coordination across departments, encourages the incorporation of views beyond clinical area, and provides non-judgemental evidence based on objective data [ 110 ].

The current review recognized that it was not easy to implement CQI. It requires reasonable utilization of various models and tools. The application of each tool can be varied based on the studied health problem and the purpose of CQI initiative [ 111 ], varied in context, content, structure, and usability [ 112 ]. Additionally, overcoming the cultural, technical, structural, and strategic-related barriers. These barriers have emerged from clinical staff, managers, and health systems perspectives. Of the cultural obstacles, staff non-involvement, resistance to change, and reluctance to report error were staff-related. In contrast, others, such as the absence of celebration for success and hierarchical and rational culture, may require staff and manager involvement. Staff members may exhibit reluctance in reporting errors due to various cultural factors, including lack of trust, hierarchical structures, fear of retribution, and a blame-oriented culture. These challenges pose obstacles to implementing standardized CQI practices, as observed, for instance, in community pharmacy settings [ 85 ]. The hierarchical culture, characterized by clearly defined levels of power, authority, and decision-making, posed challenges to implementing CQI initiatives in public health [ 41 , 86 ]. Although rational culture, a type of organizational culture, emphasizes logical thinking and rational decision-making, it can also create challenges for CQI implementation [ 41 , 86 ] because hierarchical and rational cultures, which emphasize bureaucratic norms and narrow definitions of achievement, were found to act as barriers to the implementation of CQI [ 86 ]. These could be solved by developing a shared mindset and collective commitment, establishing a shared purpose, developing group norms, and cultivating psychological preparedness among staff, managers, and clients to implement and sustain CQI initiatives. Furthermore, reversing cultural-related barriers necessitates cultural-related solutions: development of a culture and group culture to CQI [ 41 , 86 ], positive comprehensive perception [ 91 ], commitment [ 85 ], involving patients, families, leaders, and staff [ 39 , 92 ], collaborating for a common goal [ 80 , 86 ], effective teamwork [ 86 , 87 ], and rewarding and celebrating successes [ 80 , 90 ].

The technical dimension barriers of CQI can include inadequate capitalization of a project and insufficient support for CQI facilitators and data entry managers [ 36 ], immature electronic medical records or poor information systems [ 36 , 86 ], and the lack of training and skills [ 86 , 87 , 88 ]. These challenges may cause the CQI team to rely on outdated information and technologies. The presence of barriers on the technical dimension may challenge the solid foundation of CQI expertise among staff, the ability to recognize opportunities for improvement, a comprehensive understanding of how services are produced and delivered, and routine use of expertise in daily work. Addressing these technical barriers requires knowledge creation activities (training, seminar, and education) [ 39 , 42 , 53 , 69 , 86 , 90 , 91 ], availability of quality data [ 86 ], reliable information [ 92 ], and a manual-online hybrid reporting system [ 85 ].

Structural dimension barriers of CQI include inadequate communication channels and lack of standardized process, specifically weak physician-to-physician synergies [ 36 ], lack of mechanisms for disseminating knowledge and limited use of communication mechanisms [ 86 ]. Lack of communication mechanism endangers sharing ideas and feedback among CQI teams, leading to misunderstandings, limited participation and misinterpretations, and a lack of learning [ 113 ]. Knowledge translation facilitates the co-production of research, subsequent diffusion of knowledge, and the developing stakeholder’s capacity and skills [ 114 ]. Thus, the absence of a knowledge translation mechanism may cause missed opportunities for learning, inefficient problem-solving, and limited creativity. To overcome these challenges, organizations should establish effective communication and information systems [ 86 , 93 ] and learning systems [ 92 ]. Though CQI and knowledge translation have interacted with each other, it is essential to recognize that they are distinct. CQI focuses on process improvement within health care systems, aiming to optimize existing processes, reduce errors, and enhance efficiency.

In contrast, knowledge translation bridges the gap between research evidence and clinical practice, translating research findings into actionable knowledge for practitioners. While both CQI and knowledge translation aim to enhance health care quality and patient outcomes, they employ different strategies: CQI utilizes tools like Plan-Do-Study-Act cycles and statistical process control, while knowledge translation involves knowledge synthesis and dissemination. Additionally, knowledge translation can also serve as a strategy to enhance CQI. Both concepts share the same principle: continuous improvement is essential for both. Therefore, effective strategies on the structural dimension may build efficient and effective steering councils, information systems, and structures to diffuse learning throughout the organization.

Strategic factors, such as goals, planning, funds, and resources, determine the overall purpose of CQI initiatives. Specific barriers were improper goals and poor planning [ 36 , 86 , 88 ], fragmentation of quality assurance policies [ 87 ], inadequate reinforcement to staff [ 36 , 90 ], time constraints [ 85 , 86 ], resource inadequacy [ 86 ], and work overload [ 86 ]. These barriers can be addressed through strengthening leadership [ 86 , 87 ], CQI-based mentoring [ 94 ], periodic monitoring, supportive supervision and coaching [ 43 , 53 , 87 , 92 , 95 ], participation, empowerment, and accountability [ 67 ], involving all stakeholders in decision-making [ 86 , 87 ], a provider-payer partnership [ 64 ], and compensating staff for after-hours meetings on CQI [ 85 ]. The strategic dimension, characterized by a strategic plan and integrated CQI efforts, is devoted to processes that are central to achieving strategic priorities. Roles and responsibilities are defined in terms of integrated strategic and quality-related goals [ 115 ].

The utmost goal of CQI has been to improve the quality of care, which is usually revealed by structure, process, and outcome. After resolving challenges and effectively using tools and running models, the goal of CQI reflects the ultimate reason and purpose of its implementation. First, effectively implemented CQI initiatives can improve leadership, health financing, health workforce development, health information technology, and availability of supplies as the building blocks of a health system [ 31 , 48 , 53 , 68 , 98 ]. Second, effectively implemented CQI initiatives improved care delivery process (counselling, adherence with standards, coordination, collaboration, and linkages) [ 48 , 53 , 65 , 68 ]. Third, the CQI can improve outputs of healthcare delivery, such as satisfaction, accessibility (timely access, utilization), continuity of care, safety, efficiency, and acceptability [ 52 , 54 , 55 , 76 , 78 ]. Finally, the effectiveness of the CQI initiatives has been tested in enhancing responses related to key aspects of the HIV response, maternal and child health, non-communicable disease control, and others (e.g., surgery and peritonitis). However, it is worth noting that CQI initiative has not always been effective. For instance, CQI using a two- to nine-times audit cycle model through systems assessment tools did not bring significant change to increase syphilis testing performance [ 116 ]. This study was conducted within the context of Aboriginal and Torres Strait Islander people’s primary health care settings. Notably, ‘the clinics may not have consistently prioritized syphilis testing performance in their improvement strategies, as facilitated by the CQI program’ [ 116 ]. Additionally, by applying CQI-based mentoring, uptake of facility-based interventions was not significantly improved, though it was effective in increasing community health worker visits during pregnancy and the postnatal period, knowledge about maternal and child health and exclusive breastfeeding practice, and HIV disclosure status [ 117 ]. The study conducted in South Africa revealed no significant association between the coverage of facility-based interventions and Continuous Quality Improvement (CQI) implementation. This lack of association was attributed to the already high antenatal and postnatal attendance rates in both control and intervention groups at baseline, leaving little room for improvement. Additionally, the coverage of HIV interventions remained consistently high throughout the study period [ 117 ].

Regarding health care and policy implications, CQI has played a vital role in advancing PHC and fostering the realization of UHC goals worldwide. The indicators found in Donabedian’s framework that are positively influenced by CQI efforts are comparable to those included in the PHC performance initiative’s conceptual framework [ 29 , 118 , 119 ]. It is clearly explained that PHC serves as the roadmap to realizing the vision of UHC [ 120 , 121 ]. Given these circumstances, implementing CQI can contribute to the achievement of PHC principles and the objectives of UHC. For instance, by implementing CQI methods, countries have enhanced the accessibility, affordability, and quality of PHC services, leading to better health outcomes for their populations. CQI has facilitated identifying and resolving healthcare gaps and inefficiencies, enabling countries to optimize resource allocation and deliver more effective and patient-centered care. However, it is crucial to recognize that the successful implementation of Continuous Quality Improvement (CQI) necessitates optimizing the duration of each cycle, understanding challenges and barriers that extend beyond the health system and settings, and acknowledging that its effectiveness may be compromised if these challenges are not adequately addressed.

Despite abundant literature, there are still gaps regarding the relationship between CQI and other dimensions within the healthcare system. No studies have examined the impact of CQI initiatives on catastrophic health expenditure, effective service coverage, patient-centredness, comprehensiveness, equity, health security, and responsiveness.

Limitations

In conducting this review, it has some limitations to consider. Firstly, only articles published in English were included, which may introduce the exclusion of relevant non-English articles. Additionally, as this review follows a scoping methodology, the focus is on synthesising available evidence rather than critically evaluating or scoring the quality of the included articles.

Continuous quality improvement is investigated as a continuous and ongoing intervention, where the implementation time can vary across different cycles. The CQI team and implementation timelines were critical elements of CQI in different models. Among the commonly used approaches, the PDSA or PDCA is frequently employed. In most CQI models, a wide range of tools, nineteen tools, are commonly utilized to support the improvement process. Cultural, technical, structural, and strategic barriers and facilitators are significant in implementing CQI initiatives. Implementing the CQI initiative aims to improve health system blocks, enhance health service delivery process and output, and ultimately prevent morbidity and reduce mortality. For future researchers, considering that CQI is context-dependent approach, conducting scale-up implementation research about catastrophic health expenditure, effective service coverage, patient-centredness, comprehensiveness, equity, health security, and responsiveness across various settings and health issues would be valuable.

Availability of data and materials

The data used and/or analyzed during the current study are available in this manuscript and/or the supplementary file.

Shewhart WA, Deming WE. Memoriam: Walter A. Shewhart, 1891–1967. Am Stat. 1967;21(2):39–40.

Article   Google Scholar  

Shewhart WA. Statistical method from the viewpoint of quality control. New York: Dover; 1986. ISBN 978-0486652320. OCLC 13822053. Reprint. Originally published: Washington, DC: Graduate School of the Department of Agriculture, 1939.

Moen R, editor Foundation and History of the PDSA Cycle. Asian network for quality conference Tokyo. https://www.deming.org/sites/default/files/pdf/2015/PDSA_History_Ron_MoenPdf . 2009.

Kuperman G, James B, Jacobsen J, Gardner RM. Continuous quality improvement applied to medical care: experiences at LDS hospital. Med Decis Making. 1991;11(4suppl):S60–65.

Article   CAS   PubMed   Google Scholar  

Singh J, Singh H. Continuous improvement philosophy–literature review and directions. Benchmarking: An International Journal. 2015;22(1):75–119.

Goldstone J. Presidential address: Sony, Porsche, and vascular surgery in the 21st century. J Vasc Surg. 1997;25(2):201–10.

Radawski D. Continuous quality improvement: origins, concepts, problems, and applications. J Physician Assistant Educ. 1999;10(1):12–6.

Shortell SM, O’Brien JL, Carman JM, Foster RW, Hughes E, Boerstler H, et al. Assessing the impact of continuous quality improvement/total quality management: concept versus implementation. Health Serv Res. 1995;30(2):377.

CAS   PubMed   PubMed Central   Google Scholar  

Lohr K. Quality of health care: an introduction to critical definitions, concepts, principles, and practicalities. Striving for quality in health care. 1991.

Berwick DM. The clinical process and the quality process. Qual Manage Healthc. 1992;1(1):1–8.

Article   CAS   Google Scholar  

Gift B. On the road to TQM. Food Manage. 1992;27(4):88–9.

CAS   PubMed   Google Scholar  

Greiner A, Knebel E. The core competencies needed for health care professionals. health professions education: A bridge to quality. 2003:45–73.

McCalman J, Bailie R, Bainbridge R, McPhail-Bell K, Percival N, Askew D et al. Continuous quality improvement and comprehensive primary health care: a systems framework to improve service quality and health outcomes. Front Public Health. 2018:6 (76):1–6.

Sheingold BH, Hahn JA. The history of healthcare quality: the first 100 years 1860–1960. Int J Afr Nurs Sci. 2014;1:18–22.

Google Scholar  

Donabedian A. Evaluating the quality of medical care. Milbank Q. 1966;44(3):166–206.

Institute of Medicine (US) Committee on Quality of Health Care in America. Crossing the Quality Chasm: A New Health System for the 21st Century. Washington (DC): National Academies Press (US). 2001. 2, Improving the 21st-century Health Care System. Available from: https://www.ncbi.nlm.nih.gov/books/NBK222265/ .

Rubinstein A, Barani M, Lopez AS. Quality first for effective universal health coverage in low-income and middle-income countries. Lancet Global Health. 2018;6(11):e1142–1143.

Article   PubMed   Google Scholar  

Agency for Healthcare Reserach and Quality. Quality Improvement and monitoring at your fingertips USA,: Agency for Healthcare Reserach and Quality. 2022. Available from: https://qualityindicators.ahrq.gov/ .

Anderson CA, Cassidy B, Rivenburgh P. Implementing continuous quality improvement (CQI) in hospitals: lessons learned from the International Quality Study. Qual Assur Health Care. 1991;3(3):141–6.

Gardner K, Mazza D. Quality in general practice - definitions and frameworks. Aust Fam Physician. 2012;41(3):151–4.

PubMed   Google Scholar  

Loper AC, Jensen TM, Farley AB, Morgan JD, Metz AJ. A systematic review of approaches for continuous quality improvement capacity-building. J Public Health Manage Pract. 2022;28(2):E354.

Hill JE, Stephani A-M, Sapple P, Clegg AJ. The effectiveness of continuous quality improvement for developing professional practice and improving health care outcomes: a systematic review. Implement Sci. 2020;15(1):1–14.

Candas B, Jobin G, Dubé C, Tousignant M, Abdeljelil AB, Grenier S, et al. Barriers and facilitators to implementing continuous quality improvement programs in colonoscopy services: a mixed methods systematic review. Endoscopy Int Open. 2016;4(02):E118–133.

Peters MD, Marnie C, Colquhoun H, Garritty CM, Hempel S, Horsley T, et al. Scoping reviews: reinforcing and advancing the methodology and application. Syst Reviews. 2021;10(1):1–6.

Arksey H, O’Malley L. Scoping studies: towards a methodological framework. Int J Soc Res Methodol. 2005;8(1):19–32.

Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann Intern Med. 2018;169(7):467–73.

McGowan J, Straus S, Moher D, Langlois EV, O’Brien KK, Horsley T, et al. Reporting scoping reviews—PRISMA ScR extension. J Clin Epidemiol. 2020;123:177–9.

Donabedian A. Explorations in quality assessment and monitoring: the definition of quality and approaches to its assessment. Health Administration Press, Ann Arbor. 1980;1.

World Health Organization. Operational framework for primary health care: transforming vision into action. Geneva: World Health Organization and the United Nations Children’s Fund (UNICEF); 2020 [updated 14 December 2020; cited 2023 Nov Oct 17]. Available from: https://www.who.int/publications/i/item/9789240017832 .

The Joanna Briggs Institute. The Joanna Briggs Institute Reviewers’ Manual :2014 edition. Australia: The Joanna Briggs Institute. 2014:88–91.

Rihal CS, Kamath CC, Holmes DR Jr, Reller MK, Anderson SS, McMurtry EK, et al. Economic and clinical outcomes of a physician-led continuous quality improvement intervention in the delivery of percutaneous coronary intervention. Am J Manag Care. 2006;12(8):445–52.

Ade-Oshifogun JB, Dufelmeier T. Prevention and Management of Do not return notices: a quality improvement process for Supplemental staffing nursing agencies. Nurs Forum. 2012;47(2):106–12.

Rubenstein L, Khodyakov D, Hempel S, Danz M, Salem-Schatz S, Foy R, et al. How can we recognize continuous quality improvement? Int J Qual Health Care. 2014;26(1):6–15.

O’Neill SM, Hempel S, Lim YW, Danz MS, Foy R, Suttorp MJ, et al. Identifying continuous quality improvement publications: what makes an improvement intervention ‘CQI’? BMJ Qual Saf. 2011;20(12):1011–9.

Article   PubMed   PubMed Central   Google Scholar  

Sibthorpe B, Gardner K, McAullay D. Furthering the quality agenda in Aboriginal community controlled health services: understanding the relationship between accreditation, continuous quality improvement and national key performance indicator reporting. Aust J Prim Health. 2016;22(4):270–5.

Bennett CL, Crane JM. Quality improvement efforts in oncology: are we ready to begin? Cancer Invest. 2001;19(1):86–95.

VanValkenburgh DA. Implementing continuous quality improvement at the facility level. Adv Ren Replace Ther. 2001;8(2):104–13.

Loper AC, Jensen TM, Farley AB, Morgan JD, Metz AJ. A systematic review of approaches for continuous quality improvement capacity-building. J Public Health Manage Practice. 2022;28(2):E354–361.

Ryan M. Achieving and sustaining quality in healthcare. Front Health Serv Manag. 2004;20(3):3–11.

Nicolucci A, Allotta G, Allegra G, Cordaro G, D’Agati F, Di Benedetto A, et al. Five-year impact of a continuous quality improvement effort implemented by a network of diabetes outpatient clinics. Diabetes Care. 2008;31(1):57–62.

Wakefield BJ, Blegen MA, Uden-Holman T, Vaughn T, Chrischilles E, Wakefield DS. Organizational culture, continuous quality improvement, and medication administration error reporting. Am J Med Qual. 2001;16(4):128–34.

Sori DA, Debelew GT, Degefa LS, Asefa Z. Continuous quality improvement strategy for increasing immediate postpartum long-acting reversible contraceptive use at Jimma University Medical Center, Jimma, Ethiopia. BMJ Open Qual. 2023;12(1):e002051.

Roche B, Robin C, Deleaval PJ, Marti MC. Continuous quality improvement in ambulatory surgery: the non-attending patient. Ambul Surg. 1998;6(2):97–100.

O’Connor JB, Sondhi SS, Mullen KD, McCullough AJ. A continuous quality improvement initiative reduces inappropriate prescribing of prophylactic antibiotics for endoscopic procedures. Am J Gastroenterol. 1999;94(8):2115–21.

Ursu A, Greenberg G, McKee M. Continuous quality improvement methodology: a case study on multidisciplinary collaboration to improve chlamydia screening. Fam Med Community Health. 2019;7(2):e000085.

Quick B, Nordstrom S, Johnson K. Using continuous quality improvement to implement evidence-based medicine. Lippincotts Case Manag. 2006;11(6):305–15 ( quiz 16 – 7 ).

Oyeledun B, Phillips A, Oronsaye F, Alo OD, Shaffer N, Osibo B, et al. The effect of a continuous quality improvement intervention on retention-in-care at 6 months postpartum in a PMTCT Program in Northern Nigeria: results of a cluster randomized controlled study. J Acquir Immune Defic Syndr. 2017;75(Suppl 2):S156–164.

Nyengerai T, Phohole M, Iqaba N, Kinge CW, Gori E, Moyo K, et al. Quality of service and continuous quality improvement in voluntary medical male circumcision programme across four provinces in South Africa: longitudinal and cross-sectional programme data. PLoS ONE. 2021;16(8):e0254850.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Wang J, Zhang H, Liu J, Zhang K, Yi B, Liu Y, et al. Implementation of a continuous quality improvement program reduces the occurrence of peritonitis in PD. Ren Fail. 2014;36(7):1029–32.

Stikes R, Barbier D. Applying the plan-do-study-act model to increase the use of kangaroo care. J Nurs Manag. 2013;21(1):70–8.

Wagner AD, Mugo C, Bluemer-Miroite S, Mutiti PM, Wamalwa DC, Bukusi D, et al. Continuous quality improvement intervention for adolescent and young adult HIV testing services in Kenya improves HIV knowledge. AIDS. 2017;31(Suppl 3):S243–252.

Le RD, Melanson SE, Santos KS, Paredes JD, Baum JM, Goonan EM, et al. Using lean principles to optimise inpatient phlebotomy services. J Clin Pathol. 2014;67(8):724–30.

Manyazewal T, Mekonnen A, Demelew T, Mengestu S, Abdu Y, Mammo D, et al. Improving immunization capacity in Ethiopia through continuous quality improvement interventions: a prospective quasi-experimental study. Infect Dis Poverty. 2018;7:7.

Kamiya Y, Ishijma H, Hagiwara A, Takahashi S, Ngonyani HAM, Samky E. Evaluating the impact of continuous quality improvement methods at hospitals in Tanzania: a cluster-randomized trial. Int J Qual Health Care. 2017;29(1):32–9.

Kibbe DC, Bentz E, McLaughlin CP. Continuous quality improvement for continuity of care. J Fam Pract. 1993;36(3):304–8.

Adrawa N, Ongiro S, Lotee K, Seret J, Adeke M, Izudi J. Use of a context-specific package to increase sputum smear monitoring among people with pulmonary tuberculosis in Uganda: a quality improvement study. BMJ Open Qual. 2023;12(3):1–6.

Hunt P, Hunter SB, Levan D. Continuous quality improvement in substance abuse treatment facilities: how much does it cost? J Subst Abuse Treat. 2017;77:133–40.

Azadeh A, Ameli M, Alisoltani N, Motevali Haghighi S. A unique fuzzy multi-control approach for continuous quality improvement in a radio therapy department. Qual Quantity. 2016;50(6):2469–93.

Memiah P, Tlale J, Shimabale M, Nzyoka S, Komba P, Sebeza J, et al. Continuous quality improvement (CQI) institutionalization to reach 95:95:95 HIV targets: a multicountry experience from the Global South. BMC Health Serv Res. 2021;21(1):711.

Yapa HM, De Neve JW, Chetty T, Herbst C, Post FA, Jiamsakul A, et al. The impact of continuous quality improvement on coverage of antenatal HIV care tests in rural South Africa: results of a stepped-wedge cluster-randomised controlled implementation trial. PLoS Med. 2020;17(10):e1003150.

Dadi TL, Abebo TA, Yeshitla A, Abera Y, Tadesse D, Tsegaye S, et al. Impact of quality improvement interventions on facility readiness, quality and uptake of maternal and child health services in developing regions of Ethiopia: a secondary analysis of programme data. BMJ Open Qual. 2023;12(4):e002140.

Weinberg M, Fuentes JM, Ruiz AI, Lozano FW, Angel E, Gaitan H, et al. Reducing infections among women undergoing cesarean section in Colombia by means of continuous quality improvement methods. Arch Intern Med. 2001;161(19):2357–65.

Andreoni V, Bilak Y, Bukumira M, Halfer D, Lynch-Stapleton P, Perez C. Project management: putting continuous quality improvement theory into practice. J Nurs Care Qual. 1995;9(3):29–37.

Balfour ME, Zinn TE, Cason K, Fox J, Morales M, Berdeja C, et al. Provider-payer partnerships as an engine for continuous quality improvement. Psychiatric Serv. 2018;69(6):623–5.

Agurto I, Sandoval J, De La Rosa M, Guardado ME. Improving cervical cancer prevention in a developing country. Int J Qual Health Care. 2006;18(2):81–6.

Anderson CI, Basson MD, Ali M, Davis AT, Osmer RL, McLeod MK, et al. Comprehensive multicenter graduate surgical education initiative incorporating entrustable professional activities, continuous quality improvement cycles, and a web-based platform to enhance teaching and learning. J Am Coll Surg. 2018;227(1):64–76.

Benjamin S, Seaman M. Applying continuous quality improvement and human performance technology to primary health care in Bahrain. Health Care Superv. 1998;17(1):62–71.

Byabagambi J, Marks P, Megere H, Karamagi E, Byakika S, Opio A, et al. Improving the quality of voluntary medical male circumcision through use of the continuous quality improvement approach: a pilot in 30 PEPFAR-Supported sites in Uganda. PLoS ONE. 2015;10(7):e0133369.

Hogg S, Roe Y, Mills R. Implementing evidence-based continuous quality improvement strategies in an urban Aboriginal Community Controlled Health Service in South East Queensland: a best practice implementation pilot. JBI Database Syst Rev Implement Rep. 2017;15(1):178–87.

Hopper MB, Morgan S. Continuous quality improvement initiative for pressure ulcer prevention. J Wound Ostomy Cont Nurs. 2014;41(2):178–80.

Ji J, Jiang DD, Xu Z, Yang YQ, Qian KY, Zhang MX. Continuous quality improvement of nutrition management during radiotherapy in patients with nasopharyngeal carcinoma. Nurs Open. 2021;8(6):3261–70.

Chen M, Deng JH, Zhou FD, Wang M, Wang HY. Improving the management of anemia in hemodialysis patients by implementing the continuous quality improvement program. Blood Purif. 2006;24(3):282–6.

Reeves S, Matney K, Crane V. Continuous quality improvement as an ideal in hospital practice. Health Care Superv. 1995;13(4):1–12.

Barton AJ, Danek G, Johns P, Coons M. Improving patient outcomes through CQI: vascular access planning. J Nurs Care Qual. 1998;13(2):77–85.

Buttigieg SC, Gauci D, Dey P. Continuous quality improvement in a Maltese hospital using logical framework analysis. J Health Organ Manag. 2016;30(7):1026–46.

Take N, Byakika S, Tasei H, Yoshikawa T. The effect of 5S-continuous quality improvement-total quality management approach on staff motivation, patients’ waiting time and patient satisfaction with services at hospitals in Uganda. J Public Health Afr. 2015;6(1):486.

PubMed   PubMed Central   Google Scholar  

Jacobson GH, McCoin NS, Lescallette R, Russ S, Slovis CM. Kaizen: a method of process improvement in the emergency department. Acad Emerg Med. 2009;16(12):1341–9.

Agarwal S, Gallo J, Parashar A, Agarwal K, Ellis S, Khot U, et al. Impact of lean six sigma process improvement methodology on cardiac catheterization laboratory efficiency. Catheter Cardiovasc Interv. 2015;85:S119.

Rahul G, Samanta AK, Varaprasad G A Lean Six Sigma approach to reduce overcrowding of patients and improving the discharge process in a super-specialty hospital. In 2020 International Conference on System, Computation, Automation and Networking (ICSCAN) 2020 July 3 (pp. 1-6). IEEE

Patel J, Nattabi B, Long R, Durey A, Naoum S, Kruger E, et al. The 5 C model: A proposed continuous quality improvement framework for volunteer dental services in remote Australian Aboriginal communities. Community Dent Oral Epidemiol. 2023;51(6):1150–8.

Van Acker B, McIntosh G, Gudes M. Continuous quality improvement techniques enhance HMO members’ immunization rates. J Healthc Qual. 1998;20(2):36–41.

Horine PD, Pohjala ED, Luecke RW. Healthcare financial managers and CQI. Healthc Financ Manage. 1993;47(9):34.

Reynolds JL. Reducing the frequency of episiotomies through a continuous quality improvement program. CMAJ. 1995;153(3):275–82.

Bunik M, Galloway K, Maughlin M, Hyman D. First five quality improvement program increases adherence and continuity with well-child care. Pediatr Qual Saf. 2021;6(6):e484.

Boyle TA, MacKinnon NJ, Mahaffey T, Duggan K, Dow N. Challenges of standardized continuous quality improvement programs in community pharmacies: the case of SafetyNET-Rx. Res Social Adm Pharm. 2012;8(6):499–508.

Price A, Schwartz R, Cohen J, Manson H, Scott F. Assessing continuous quality improvement in public health: adapting lessons from healthcare. Healthc Policy. 2017;12(3):34–49.

Gage AD, Gotsadze T, Seid E, Mutasa R, Friedman J. The influence of continuous quality improvement on healthcare quality: a mixed-methods study from Zimbabwe. Soc Sci Med. 2022;298:114831.

Chan YC, Ho SJ. Continuous quality improvement: a survey of American and Canadian healthcare executives. Hosp Health Serv Adm. 1997;42(4):525–44.

Balas EA, Puryear J, Mitchell JA, Barter B. How to structure clinical practice guidelines for continuous quality improvement? J Med Syst. 1994;18(5):289–97.

ElChamaa R, Seely AJE, Jeong D, Kitto S. Barriers and facilitators to the implementation and adoption of a continuous quality improvement program in surgery: a case study. J Contin Educ Health Prof. 2022;42(4):227–35.

Candas B, Jobin G, Dubé C, Tousignant M, Abdeljelil A, Grenier S, et al. Barriers and facilitators to implementing continuous quality improvement programs in colonoscopy services: a mixed methods systematic review. Endoscopy Int Open. 2016;4(2):E118–133.

Brandrud AS, Schreiner A, Hjortdahl P, Helljesen GS, Nyen B, Nelson EC. Three success factors for continual improvement in healthcare: an analysis of the reports of improvement team members. BMJ Qual Saf. 2011;20(3):251–9.

Lee S, Choi KS, Kang HY, Cho W, Chae YM. Assessing the factors influencing continuous quality improvement implementation: experience in Korean hospitals. Int J Qual Health Care. 2002;14(5):383–91.

Horwood C, Butler L, Barker P, Phakathi S, Haskins L, Grant M, et al. A continuous quality improvement intervention to improve the effectiveness of community health workers providing care to mothers and children: a cluster randomised controlled trial in South Africa. Hum Resour Health. 2017;15(1):39.

Hyrkäs K, Lehti K. Continuous quality improvement through team supervision supported by continuous self-monitoring of work and systematic patient feedback. J Nurs Manag. 2003;11(3):177–88.

Akdemir N, Peterson LN, Campbell CM, Scheele F. Evaluation of continuous quality improvement in accreditation for medical education. BMC Med Educ. 2020;20(Suppl 1):308.

Barzansky B, Hunt D, Moineau G, Ahn D, Lai CW, Humphrey H, et al. Continuous quality improvement in an accreditation system for undergraduate medical education: benefits and challenges. Med Teach. 2015;37(11):1032–8.

Gaylis F, Nasseri R, Salmasi A, Anderson C, Mohedin S, Prime R, et al. Implementing continuous quality improvement in an integrated community urology practice: lessons learned. Urology. 2021;153:139–46.

Gaga S, Mqoqi N, Chimatira R, Moko S, Igumbor JO. Continuous quality improvement in HIV and TB services at selected healthcare facilities in South Africa. South Afr J HIV Med. 2021;22(1):1202.

Wang F, Yao D. Application effect of continuous quality improvement measures on patient satisfaction and quality of life in gynecological nursing. Am J Transl Res. 2021;13(6):6391–8.

Lee SB, Lee LL, Yeung RS, Chan J. A continuous quality improvement project to reduce medication error in the emergency department. World J Emerg Med. 2013;4(3):179–82.

Chiang AA, Lee KC, Lee JC, Wei CH. Effectiveness of a continuous quality improvement program aiming to reduce unplanned extubation: a prospective study. Intensive Care Med. 1996;22(11):1269–71.

Chinnaiyan K, Al-Mallah M, Goraya T, Patel S, Kazerooni E, Poopat C, et al. Impact of a continuous quality improvement initiative on appropriate use of coronary CT angiography: results from a multicenter, statewide registry, the advanced cardiovascular imaging consortium (ACIC). J Cardiovasc Comput Tomogr. 2011;5(4):S29–30.

Gibson-Helm M, Rumbold A, Teede H, Ranasinha S, Bailie R, Boyle J. A continuous quality improvement initiative: improving the provision of pregnancy care for Aboriginal and Torres Strait Islander women. BJOG: Int J Obstet Gynecol. 2015;122:400–1.

Bennett IM, Coco A, Anderson J, Horst M, Gambler AS, Barr WB, et al. Improving maternal care with a continuous quality improvement strategy: a report from the interventions to minimize preterm and low birth weight infants through continuous improvement techniques (IMPLICIT) network. J Am Board Fam Med. 2009;22(4):380–6.

Krall SP, Iv CLR, Donahue L. Effect of continuous quality improvement methods on reducing triage to thrombolytic interval for Acute myocardial infarction. Acad Emerg Med. 1995;2(7):603–9.

Swanson TK, Eilers GM. Physician and staff acceptance of continuous quality improvement. Fam Med. 1994;26(9):583–6.

Yu Y, Zhou Y, Wang H, Zhou T, Li Q, Li T, et al. Impact of continuous quality improvement initiatives on clinical outcomes in peritoneal dialysis. Perit Dial Int. 2014;34(Suppl 2):S43–48.

Schiff GD, Goldfield NI. Deming meets Braverman: toward a progressive analysis of the continuous quality improvement paradigm. Int J Health Serv. 1994;24(4):655–73.

American Hospital Association Division of Quality Resources Chicago, IL: The role of hospital leadership in the continuous improvement of patient care quality. American Hospital Association. J Healthc Qual. 1992;14(5):8–14,22.

Scriven M. The Logic and Methodology of checklists [dissertation]. Western Michigan University; 2000.

Hales B, Terblanche M, Fowler R, Sibbald W. Development of medical checklists for improved quality of patient care. Int J Qual Health Care. 2008;20(1):22–30.

Vermeir P, Vandijck D, Degroote S, Peleman R, Verhaeghe R, Mortier E, et al. Communication in healthcare: a narrative review of the literature and practical recommendations. Int J Clin Pract. 2015;69(11):1257–67.

Eljiz K, Greenfield D, Hogden A, Taylor R, Siddiqui N, Agaliotis M, et al. Improving knowledge translation for increased engagement and impact in healthcare. BMJ open Qual. 2020;9(3):e000983.

O’Brien JL, Shortell SM, Hughes EF, Foster RW, Carman JM, Boerstler H, et al. An integrative model for organization-wide quality improvement: lessons from the field. Qual Manage Healthc. 1995;3(4):19–30.

Adily A, Girgis S, D’Este C, Matthews V, Ward JE. Syphilis testing performance in Aboriginal primary health care: exploring impact of continuous quality improvement over time. Aust J Prim Health. 2020;26(2):178–83.

Horwood C, Butler L, Barker P, Phakathi S, Haskins L, Grant M, et al. A continuous quality improvement intervention to improve the effectiveness of community health workers providing care to mothers and children: a cluster randomised controlled trial in South Africa. Hum Resour Health. 2017;15:1–11.

Veillard J, Cowling K, Bitton A, Ratcliffe H, Kimball M, Barkley S, et al. Better measurement for performance improvement in low- and middle-income countries: the primary Health Care Performance Initiative (PHCPI) experience of conceptual framework development and indicator selection. Milbank Q. 2017;95(4):836–83.

Barbazza E, Kringos D, Kruse I, Klazinga NS, Tello JE. Creating performance intelligence for primary health care strengthening in Europe. BMC Health Serv Res. 2019;19(1):1006.

Assefa Y, Hill PS, Gilks CF, Admassu M, Tesfaye D, Van Damme W. Primary health care contributions to universal health coverage. Ethiopia Bull World Health Organ. 2020;98(12):894.

Van Weel C, Kidd MR. Why strengthening primary health care is essential to achieving universal health coverage. CMAJ. 2018;190(15):E463–466.

Download references

Acknowledgements

Not applicable.

The authors received no fund.

Author information

Authors and affiliations.

School of Public Health, The University of Queensland, Brisbane, Australia

Aklilu Endalamaw, Resham B Khatri, Tesfaye Setegn Mengistu, Daniel Erku & Yibeltal Assefa

College of Medicine and Health Sciences, Bahir Dar University, Bahir Dar, Ethiopia

Aklilu Endalamaw & Tesfaye Setegn Mengistu

Health Social Science and Development Research Institute, Kathmandu, Nepal

Resham B Khatri

Centre for Applied Health Economics, School of Medicine, Grifth University, Brisbane, Australia

Daniel Erku

Menzies Health Institute Queensland, Grifth University, Brisbane, Australia

International Institute for Primary Health Care in Ethiopia, Addis Ababa, Ethiopia

Eskinder Wolka & Anteneh Zewdie

You can also search for this author in PubMed   Google Scholar

Contributions

AE conceptualized the study, developed the first draft of the manuscript, and managing feedbacks from co-authors. YA conceptualized the study, provided feedback, and supervised the whole processes. RBK provided feedback throughout. TSM provided feedback throughout. DE provided feedback throughout. EW provided feedback throughout. AZ provided feedback throughout. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Aklilu Endalamaw .

Ethics declarations

Ethics approval and consent to participate.

Not applicable because this research is based on publicly available articles.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary material 1., supplementary material 2., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Endalamaw, A., Khatri, R.B., Mengistu, T.S. et al. A scoping review of continuous quality improvement in healthcare system: conceptualization, models and tools, barriers and facilitators, and impact. BMC Health Serv Res 24 , 487 (2024). https://doi.org/10.1186/s12913-024-10828-0

Download citation

Received : 27 December 2023

Accepted : 05 March 2024

Published : 19 April 2024

DOI : https://doi.org/10.1186/s12913-024-10828-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Continuous quality improvement
  • Quality of Care

BMC Health Services Research

ISSN: 1472-6963

peer reviewed journal on quantitative research

bg-image

Sink & Kalkbrenner Receive 2018 Outstanding Scholar Award

peer reviewed journal on quantitative research

About the Authors

Dr. Chris Sink, professor, and Dr. Michael Kalkbrenner, doctoral alumni, received the 2018 Outstanding Scholar Award for Quantitative or Qualitative Research for their article, " Development and Validation of the College Mental Health Perceived Competency Scale ." from the national, peer-reviewed journal: The Professional Counselor.

Congratulations Drs. Sink & Kalkbrenner!

Enhance your college career by gaining relevant experience with the skills and knowledge needed for your future career. Discover our experiential learning opportunities.

Picture yourself in the classroom, speak with professors in your major, and meet current students.

From sports games to concerts and lectures, join the ODU community at a variety of campus events. 

IMAGES

  1. How to Publish Your Article in a Peer-Reviewed Journal: Survival Guide

    peer reviewed journal on quantitative research

  2. reviewed articles examples

    peer reviewed journal on quantitative research

  3. What is Peer Review?

    peer reviewed journal on quantitative research

  4. How to find if the journal is peer reviewed or not? How to tell if a

    peer reviewed journal on quantitative research

  5. Peer Review Process

    peer reviewed journal on quantitative research

  6. (PDF) Conducting an article critique for a quantitative research study

    peer reviewed journal on quantitative research

VIDEO

  1. Study finds psychopathy and narcissism linked to leftist extremism

  2. Ray's Research Corner: Qualitative and Quantitative Research

  3. 2014 05 15 11 01 How to Assess a Peer Reviewed Quantitative Research Article William Bannon

  4. What is quantitative research?

  5. What is Qualitative Research

  6. How to Find Quantitative Journal Articles

COMMENTS

  1. Quantitative and Qualitative Approaches to Generalization and Replication-A Representationalist View

    Second, quantitative research may exploit the bottom-up generalization strategy that is inherent to many qualitative approaches. This offers a new perspective on unsuccessful replications by treating them not as scientific failures, but as a valuable source of information about the scope of a theory. ... Journal Dev. Educ. Glob. Learn. 7, 6 ...

  2. Recent quantitative research on determinants of health in high ...

    Background Identifying determinants of health and understanding their role in health production constitutes an important research theme. We aimed to document the state of recent multi-country research on this theme in the literature. Methods We followed the PRISMA-ScR guidelines to systematically identify, triage and review literature (January 2013—July 2019). We searched for studies that ...

  3. (PDF) Quantitative Research Methods : A Synopsis Approach

    Abstract. The aim of th is study i s to e xplicate the quanti tative methodology. The study established that. quantitative research de als with quantifying and analyzing variables in o rder to get ...

  4. Review Article Synthesizing Quantitative Evidence for Evidence-based

    In quantitative research, the aim is usually to establish changes in the other variable (experimental studies), and/or imply a correlation or association between variables (observational studies). ... Flaws in their presentation in peer-reviewed journals led to the establishment of the Preferred Items for Reporting Systematic Reviews and Meta ...

  5. Quantitative measures of health policy implementation determinants and

    There were several inclusion criteria: (1) empirical studies of the implementation of public policies already passed or approved that addressed physical or behavioral health, (2) quantitative self-report or archival measurement methods utilized, (3) published in peer-reviewed journals from January 1995 through April 2019, (4) published in the ...

  6. Quantitative Research

    The peer review process includes a detailed review by experts in the field, who will assess the methods used, sample size, and limitations of the study. A study may be rejected for publication if it does not meet the rigorous standards of the journal or is deemed to be poorly conducted (see also Chap. 59, "Critical Appraisal of Quantitative ...

  7. Quantitative Data Analysis—In the Graduate Curriculum

    A quantitative research study collects numerical data that must be analyzed to help draw the study's conclusions. ... Lam C. (2013) An overview of experimental and quasi-experimental research in technical communication journals (1992-2011). IEEE Transactions on Professional Communication 56 ... Review of analysis techniques in mental health ...

  8. Distinguishing Between Quantitative and Qualitative Research: A

    American Sociological Review, 21, 683-690. Crossref. ISI. Google Scholar. ... Living within blurry boundaries: The value of distinguishing between qualitative and quantitative research. Journal of Mixed Methods Research, 12(3), 268-279. Crossref. ISI. Google Scholar. Morgan D. L. (2018b). Rebuttal. Journal of ... Social Media Peer Communication ...

  9. Quantitative measures used in empirical evaluations of ...

    Quantitative measures used in empirical evaluations of mental health policy implementation: A systematic review ... we conducted a systematic review of peer-reviewed journal articles published in 1995-2020. Data extracted included study characteristics, measure development and testing, implementation determinants and outcomes, and measure ...

  10. Journal of the Society for Social Work and Research

    Ranked #500 out of 1,415 "Sociology and Political Science" journals. Founded in 2009, the Journal of the Society for Social Work and Research ( JSSWR) is the flagship publication of the Society for Social Work and Research (SSWR), a freestanding organization founded in 1994 to advance social work research. JSSWR is a peer-reviewed ...

  11. Is Quantitative Research Ethical? Tools for Ethically Practicing

    This editorial offers new ways to ethically practice, evaluate, and use quantitative research (QR). Our central claim is that ready-made formulas for QR, including 'best practices' and common notions of 'validity' or 'objectivity,' are often divorced from the ethical and practical implications of doing, evaluating, and using QR for specific purposes. To focus on these implications ...

  12. How to appraise quantitative research

    Title, keywords and the authors. The title of a paper should be clear and give a good idea of the subject area. The title should not normally exceed 15 words 2 and should attract the attention of the reader. 3 The next step is to review the key words. These should provide information on both the ideas or concepts discussed in the paper and the ...

  13. International Journal of Quantitative Research in Education

    IJQRE aims to enhance the practice and theory of quantitative research in education. In this journal, "education" is defined in the broadest sense of the word, to include settings outside the school. IJQRE publishes peer-reviewed, empirical research employing a variety of quantitative methods and approaches, including but not limited to surveys, cross sectional studies, longitudinal ...

  14. Critiquing Quantitative Research Reports: Key Points for the Beginner

    The first step in the critique process is for the reader to browse the abstract and article for an overview. During this initial review a great deal of information can be obtained. The abstract should provide a clear, concise overview of the study. During this review it should be noted if the title, problem statement, and research question (or ...

  15. The Methodological Underdog: A Review of Quantitative Research in the

    Differences in methodological strengths and weaknesses between quantitative and qualitative research are discussed, followed by a data mining exercise on 1,089 journal articles published in Adult Education Quarterly, Studies in Continuing Education, and International Journal of Lifelong Learning. A categorization of quantitative adult education ...

  16. Descriptive Statistics for Summarising Data

    Using the data from these three rows, we can draw the following descriptive picture. Mentabil scores spanned a range of 50 (from a minimum score of 85 to a maximum score of 135). Speed scores had a range of 16.05 s (from 1.05 s - the fastest quality decision to 17.10 - the slowest quality decision).

  17. Behind the Numbers: Questioning Questionnaires

    2. Questionnaire researchers, journal editors and reviewers should be more careful and suspicious about using published measures in management research. Designing new questionnaires is tricky and time consuming, so it is tempting to use and re-use existing ones for practical and legitimation reasons (Scherbaum & Meade, 2009). Moreover, the use ...

  18. A scoping review of continuous quality improvement in healthcare system

    We included articles if they reported results of qualitative or quantitative empirical study, case studies, analytic or descriptive synthesis, any review, and other written documents, were published in peer-reviewed journals, and were designed to address at least one of the identified research questions or one of the identified implementation ...

  19. Sink & Kalkbrenner Receive 2018 Outstanding Scholar Award

    Dr. Chris Sink, professor, and Dr. Michael Kalkbrenner, doctoral alumni, received the 2018 Outstanding Scholar Award for Quantitative or Qualitative Research for their article, " Development and Validation of the College Mental Health Perceived Competency Scale ." from the national, peer-reviewed journal: The Professional Counselor.

  20. Clarification of research design, research methods, and research

    The purpose of this research paper is to lessen the confusion over research design and offer a better understanding of these approaches. The comparison analysis obtained in this research can provide guidance for PA researchers, students and practitioners when considering the research design most appropriate for their study.