Improving Performance of Hurdle Models using

Rare-Event Weighted Logistic Regression:

Application to Maternal Mortality Data

Sharon Awuor Okello

Submitted in partial fulfilment of the requirements for the Degree of

Master of Science in Statistical Sciences of Strathmore University

Institute of Mathematical Sciences

Strathmore University

Nairobi, Kenya

June 6, 2022

This thesis is available for Library use through open access on the understanding that it is copyright
material and that no quotation from the thesis may be published without proper acknowledgement.


Declaration
I declare that this work has not been previously submitted and approved for award of a degree

by this or any other University. To the best of my knowledge and belief, the thesis contains

no material previously published or written by another person except where due reference is

made in the thesis itself.

© No part of this thesis may be reproduced without the permission of the author and

Strathmore University.

Name: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Sharon Awuor Okello. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Signature: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Date:. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .August 23, 2022. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Approval

The thesis of Sharon Awuor Okello was reviewed and approved by the following:

Dr. Collins Ojwang’ Odhiambo

Supervisor,

Institute of Mathematical Sciences, Strathmore University.

Dr. Evans Otieno Omondi

Supervisor,

Institute of Mathematical Sciences, Strathmore University.

Dr. Godfrey Madigu

Dean,

Institute of Mathematical Sciences, Strathmore University.

Dr. Bernard Shibwabo

Director,

Office of Graduate Studies, Strathmore University.

ii


Abstract
Hurdle models, which are commonly used alongside zero-inflated models to analyze dis-

persed zero-inflated count data, employ a logit link function to predict whether an observation

takes a positive count or a zero count based on a set of covariates. However, the logit model

tends to be biased toward the majority zero class in cases involving rare events, and may

underestimate the positive counts when their proportion is significantly smaller than that

of the zero counts. This research aimed to improve the performance of hurdle models by

incorporating rare-event weighted logistic regression model. Poisson and Negative Binomial

(NB) Hurdle Rare Event Weighted Logistic Regression (REWLR) model estimates were

developed and fit on various simulation conditions and maternal mortality data for perfor-

mance evaluation using Akaike Information Criterion (AIC) and Area Under Curve (AUC).

The Negative Binomial Hurdle REWLR emerged to be the best performing among all the

evaluated models due to the ability to handle dispersion and adjust for class imbalance. The

research findings will provide reliable estimates of the maternal mortality ratio in Nairobi

without the risk of over-fitting zero counts.

iii


Table of contents

List of figures vii

List of tables viii

List of abbreviations ix

Acknowledgement x

Dedication xi

1 Introduction 1

1.1 Background to the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Objective of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Literature review 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Hurdle Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 Zero-inflated Models . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.3 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Maternal Mortality in Kenya . . . . . . . . . . . . . . . . . . . . . . . . . 11


2.4 Our Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Methodology 13

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Hurdle-REWLR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.1 Poisson Hurdle-REWLR Model . . . . . . . . . . . . . . . . . . . 16

3.3.2 Negative Binomial Hurdle-REWLR Model . . . . . . . . . . . . . 17

3.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.5 Maternal Mortality data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Results and Interpretation 22

4.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Application to Maternal Deaths Data . . . . . . . . . . . . . . . . . . . . . 28

4.2.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.2 Maternal Death Models . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Discussion, Conclusion and Recommendation 34

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.4 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.4.1 Recommendation for further research . . . . . . . . . . . . . . . . 38

5.4.2 Policy recommendation . . . . . . . . . . . . . . . . . . . . . . . 38

References 39

Appendix A R CODES 42

A.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

A.2 Simulations and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

v


A.3 Analysis on Maternal Mortality Data . . . . . . . . . . . . . . . . . . . . . 55

A.3.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . 55

A.3.2 Count Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Appendix B Turnitin Report 62

Appendix C Ethics Review Approval 83

vi


List of figures

Figure 1.1: MMR Trends between 2000 - 2017: Source Organization et al. (2019) 4

Figure 4.1: AICs from models fit on Poisson Hurdle simulated data, n = 200 . . 24

Figure 4.2: AICs from models fit on Poisson Hurdle simulated data, n = 1000 . . 25

Figure 4.3: AICs from models fit on Poisson Hurdle-RE simulated data, n = 200 25

Figure 4.4: AICs from models fit on Poisson Hurdle-RE simulated data, n = 1000 26

Figure 4.5: AICs from models fit on NB Hurdle simulated data, n = 200 . . . . . 26

Figure 4.6: AICs from models fit on NB Hurdle simulated data, n = 1000 . . . . 27

Figure 4.7: AICs from models fit on NB Hurdle-RE simulated data, n = 200 . . . 27

Figure 4.8: AICs from models fit on NB Hurdle-RE simulated data, n = 1000 . . 28

Figure 4.9: Maternal Death Counts . . . . . . . . . . . . . . . . . . . . . . . . 29

Figure 4.10: ROC-AUC for the various models . . . . . . . . . . . . . . . . . . . 33

vii


List of tables

Table 3.1: Variable Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Table 4.1: AIC (Percentage Change in AIC) for Misspecified and Actual Models 23

Table 4.2: Average Count of Obstetric Conditions reported in facilities with and

without reported maternal deaths . . . . . . . . . . . . . . . . . . . 30

Table 4.3: Binary Component Coefficients . . . . . . . . . . . . . . . . . . . . 31

Table 4.4: Count Component Coefficients . . . . . . . . . . . . . . . . . . . . 31

Table 4.5: AIC for Maternal Mortality Models . . . . . . . . . . . . . . . . . . 32

Table 4.6: Observed and Expected Zero Counts . . . . . . . . . . . . . . . . . 32

viii


List of abbreviations
OLS Ordinary Least Squares WHO World Health Organization

MDSR Maternal Death Surveillance

and Response

MMR Maternal Mortality Ratio

SDG Sustainable Development

Goals

KNH Kenyatta National Hospital

ANC Antenatal Care REWLR Rare-Event Weighted Logistic Re-

gression

GLM Generalized Linear Model MNCH Maternal, Newborn and Child

Health

ix


Acknowledgement
I want to express my sincere gratitude to my academic supervisors, Dr Collins Odhiambo

and Dr Evans Omondi, for the valuable advice and support throughout the research period

and guidance during the thesis write-up. I am also grateful to the Strathmore Institute of

Mathematical Sciences and the faculty who have imparted knowledge and offered support

throughout this academic program.

x


Dedication
This thesis is dedicated to God for giving me the gift of life, knowledge and perseverance. To

my mum Prisca Atieno Muga for her love and support. To my beloved son Haris Hawi

Nyangi for whom I am motivated to be the best version of myself.

xi


Chapter 1

Introduction

1.1 Background to the study

Count data are generated by enumeration processes that produce discrete non-negative

numbers. Due to the heteroskedastic and skewed nature of these data, the standard OLS

models are not suitable for parameter estimations (Hutchinson and Holtman, 2005). Count

models provide a better fit. Poisson and Negative Binomial regression are the most commonly

used models for count data estimations. The Poisson model, considered the standard count

model, assumes that the sample variance and sample mean are equal, a condition referred to

as equidispersion. However, this is seldom the case. In practice, the sample variance is often

either greater than (overdispersed) or less than the mean (under-dispersed). Poisson models

provide a poor fit for such data.

Overdispersion in count data may arise due to several reasons, including the presence of

excess zero counts in the data (Hilbe, 2011). The negative binomial model offers a better

fit for overdispersed data but may also suffer overdispersion limitations. Overdispersion in

a negative binomial model could occur when the observed model variance is greater than

NB’s expected variance, at times, due to more zeros than the model can accommodate (Hilbe,

2014). Zero-inflated mixture models are better suitable for modelling count data with more

zeros than can be accounted for by the regular count models. These models - Zero-inflated

Poisson (ZIP), Zero-inflated Negative Binomial (ZINB), Poisson Hurdle (PH) and Negative

Binomial Hurdle (NBH) - propose separate data-generating processes for zero and positive

counts.

Zero-inflated models introduced by Lambert (1992) and Greene (1994) propose a mixture

distribution, where data is generated from Bernoulli and Poisson or Negative Binomial

1


processes. The most outstanding feature of the zero-inflated models, as explained by (Rose

et al., 2006) is the assumption of the existence of an at-risk group that can never experience an

event (structural zeros), and an at-risk group that may still not experience the event (sampling

zeros). Zeros can thus be estimated using a mixture of a binary distribution - which estimates

the probability of structural zeros, and a count model - which estimates all counts, including

zeros.

The general structure of the zero-inflated model is given by:

P(Yi = yi) =


πi +(1−πi)p(yi;λ |yi = 0) yi = 0

(1−πi) p(yi;λ ) yi > 0;

(1.1)

Where πi is the probability of being a structural zero, p(yi;λ ) is the probability mass function

of the count model, and p(yi;λ |yi = 0) is the probability mass function of a count model for

the count zero.

A typical example that would motivate the application of zero-inflated models is the case of

modelling the weekly number of cigarettes smoked by a group of people, following, say, a

policy implementation that aims to reduce cigarette smoking. Participants who respond that

they have smoked ‘zero’ cigarettes may either be non-smokers who cannot have any value

other than zero (structural zeros) or smokers who have reduced their weekly consumption to

zero (sampling zeros).

Hurdle models, also called zero-altered models, provide an alternative means of modelling

zero-inflated data. The distinctive feature between these models and the zero-inflated models

is that hurdle models assume the existence of a single structural source of zeros. The general

concept of the hurdle models is that a binomial probability model determines whether a count

response variable takes a zero or a positive number. If the response variable returns a positive

value, the ‘hurdle’ is crossed, and a zero-truncated model determines the magnitude of the

positive counts (Mullahy, 1986).

In a maternal mortality setting, as in this study, the number of zeros reported is often excessive.

From a perspective of the total number of live births, maternal deaths can be viewed as a

2


rare event. Based on WHO recommendation, data collected on maternal death through the

Maternal Death Surveillance and Response (MDSR) systems include zero-reporting where

weekly statistics are submitted even if no death has occurred (Smith et al., 2017). These are

reported as ‘zero’ deaths. However, the zero deaths reported aren’t distinguishable as from a

structural or sampling source. One can’t divide the population of women giving birth into a

risk and a not-at-risk group and be certain the not-at-risk group will only report zero cases

of death. Because of this, the current research focuses on Hurdle models for estimation of

maternal deaths.

The logistic regression model is vital in the formulations of zero-inflated mixture models. In

Hurdle models, either logistic or probit regression models are used to estimate the probability

of obtaining a positive count (Hilbe, 2014). However, logistic regression models show

limitations when predicting probabilities in imbalanced classes, e.g., prediction of zero

versus positive counts for the binary component of Hurdle models. In the case of maternal

deaths reported where the zero class may always be significantly larger than the non-zero

class, logistic regression will tend to underestimate the probability of crossing the ‘hurdle’.

Maternal death is the death of a woman while pregnant or within 42 days of pregnancy

termination, irrespective of the duration and site of the pregnancy, from any cause related

to or aggravated by the pregnancy or its management but not from accidental or incidental

causes (Organization et al., 2019). According to 2017 WHO estimates, MMR declined

by 38% globally between 2000 and 2017 from 342 deaths to 211 deaths per 100,000 live

births. Kenya reported an impressive 52% reduction in MMR from 708 to 342 for the same

period (Organization et al., 2019). This significant reduction in maternal death cases can

be attributed to policies and initiatives implemented by the Kenyan government, such as

Free Maternity Program, Beyond Zero, Linda Mama Campaign, among others. Despite all

the strides towards reducing the number of maternal deaths, MMR is still high in Kenya.

Reducing the number of maternal deaths remains a national priority (Mwangi et al., 2019).

The third SDG of the UN launched in 2015 aims for global MMR reduction to 70 or less,

or at most 140, by 2030. With the current pace of progress, Kenya may fall short of this

3


target despite the programs and initiatives in place. More research is needed to guide new

initiatives and support existing policies.

Figure 1.1: MMR Trends between 2000 - 2017: Source Organization et al. (2019)

Research into maternal mortality has involved establishing incidences, analyzing trends,

identifying factors that may influence maternal deaths. The research has also involved

developing and comparing models to determine the best-suited model for estimating and

predicting maternal deaths.

1.2 Statement of the Problem

Rare events in count data results in class imbalance, where the proportion of zero counts

is greater than that of positive counts. In such cases, the standard logistic regression may

not be optimal, Maalouf and Siddiqi (2014). Hence the need for better performing models

to estimate the probability of non-zero counts for Hurdle models. The current research

incorporates Rare-Event Weighted Logistic Regression (REWLR) in our Hurdle models’

binary component to improve the model performance.

4


1.3 Objective of the study

1.3.1 Objectives

• To investigate the performance of Hurdle-REWLR models using simulation analysis

and with the application to maternal mortality data.

• To assess the performances of Hurdle-REWLR models for various proportions of

zero-inflation.

1.3.2 Research Questions

i. Does incorporating REWLR in Hurdle models improve the performance of the models?

ii. Does the degree of class imbalance between zero and non-zero classes influence model

performance?

1.4 Justification

The goal of any statistical study is often to predict an outcome of interest or provide inference

about the same. Practical inferences about a population require the sample statistics and

estimates to be generalizable to a broader population, hence the need for models best

suited for the population from which a sample is drawn. Statisticians and researchers have

adjusted distributions and modified models to find the distribution that best explains their

data and models that offer the best fit. Subsequently, count data models were extended to

accommodate the excess zeros that arise in situations where the event of interest is rare.

Formulations of these zero-modified count models involved nesting logistic regression to

estimate the probabilities of a structural zero or a zero count for zero-inflated and hurdle

models, respectively. GLM literature suggests the logit function is symmetric, so the response

curve approaches zero and one at the same rate. This feature makes logistic regression

inefficient due to the risk of underestimating the probability of a rare event.

5


Maternal deaths in Kenya are already under-reported due to the inefficient data collection

systems and the high number of women who give birth outside healthcare facilities. Employ-

ing models that may underestimate the already under-reported maternal death cases would

harm maternal health policies, as it would offer a false sign of relief. Hence the need for

count estimation models adjusted to deal with rare events.

1.5 Significance of the Study

This study proposes a model extension to improve the performance of Hurdle models. The

model, applied to maternal mortality data, would ensure accurate estimation of the death

cases and eliminate any false relief caused by over-estimation of the zero-death cases.

6


Chapter 2

Literature review

2.1 Introduction

Variations of two statistical approaches have been used in modelling count data characterized

by excess zeros in the outcome variable. This chapter provides an overview of these statistical

approaches, the concepts behind them, and their applications in past research. We also review

the logistic regression model and its limitations in estimating probabilities of rare events.

2.2 Models

2.2.1 Hurdle Models

Mullahy (1986) developed the Poisson hurdle model to handle zero-inflated count data in

cases where sampling and structural zeros were not distinguishable. His proposed 2-part

model analyzed zero counts separately from positive counts. He applied the model to study

peoples’ daily consumption of beverages based on certain socio-demographic factors. The

study results revealed that the hurdle model allowed for more flexibility in model specification

than the basic model. The models proposed in this research could also account for both

under-dispersion and over-dispersion.

King (1989) separately developed hurdle models in an application to a political science study.

His research aimed to develop an approach that models the onset of war separately from its

escalation. The model was developed following Mullahy (1986) theory of data generation

mechanism, where certain factors determine whether a country goes or does not go into war,

7


and once a country crosses the hurdle, factors such as alliances will determine the number of

wars with which the country will be involved. This model proved to be an improvement of

Mullahy’s hurdle model.

Rose et al. (2006) applied the Poisson and Negative Binomial hurdle models in estimating the

number of adverse events reported for each subject following a vaccination injection. They

assumed a single source of zeros (sampling) because their study design made it such that all

subjects were at risk of experiencing at least one adverse event. This assumption favoured

hurdle over zero-inflated models. The goodness of fit statistics for ZINB and NBH were

indistinguishable. A quasi-experimental study by Chaudhari et al. (2012) utilized hurdle

models in the estimation of the total dental utilization using data obtained from dental claims.

The model allowed them to decompose the hurdle likelihood function to allow for individual

estimation of the probability of dental care, type of dental care and level of utilization. The

likelihood decomposing feature gave the Negative Binomial Hurdle model the edge over the

other models.

Hurdle models have also been widely applied in mortality estimation studies. Fenta and

Fenta (2020) determined NBH model over ZIP, ZINB and PH for estimating risk factors of

child mortality in Ethiopia. In a different study, NBH emerged to be the best statistical model

for estimating predictors of under-five mortality in Ethiopia. The hurdle model was also

selected as the best fitting model in the Mamun (2014) study to estimate under-five deaths.

Both pieces of research involved comparing the Hurdle models to the zero-inflated models

and, in some cases, the standard count models.

Besides the application of hurdle models in the various research fields, researchers have

also developed modified versions of the models to provide better fit for their data. One of

the hurdle model extensions was by Min and Agresti (2005), to accommodate correlated

data. The authors modified the Hurdle model to include a random effect for their research

to estimate the number of episodes of side effects recorded at each visit and compared two

treatments. Fitting the random effects hurdle models proved less complex than fitting a

zero-inflated random-effects model. In addition, the model provided more straightforward

8


interpretations. A two-part model meant the two parts could be fitted and estimated separately,

hence reducing complexity.

2.2.2 Zero-inflated Models

Since their introduction, zero-inflated models have been continually modified and applied

in many fields. Aryuyuen et al. (2014) developed the ZINB - Generalized Exponential

distribution to provide a better fit for heavy-tailed over-dispersed zero-inflated data. He

assessed the new model’s performance compared to ZIP and ZINB, applied on simulated data

and actual data for hospital stays by senior US residents. The resultant model proved to be a

better fit than the ZIP and ZINB distributions. Kibika (2020) developed the ZINB - Shanker

distribution by combining Zero-inflated Negative Binomial and Shanker distribution. The

goal of developing the new model was to allow greater flexibility by increasing randomness

in the ZINB probability distribution function. The model was used to model HIV cases

among infants exposed to HIV through breastfeeding, etc. Overall fit tests revealed ZINB to

offer the best fit. ZINB-Shanker distribution proved competitive for larger sample sizes.

Diop et al. (2021) proposed a modification to ZIP which involved the use of the quantile

function of the Generalized Extreme Value (GEV) distribution as a link function for zero-

inflated data with rare events. The approach was proposed to curb the drawbacks of logistic

regression when dealing with imbalanced data, where the probability of a rare event is

underestimated. Ali (2020) did a comparison study between ZIP, ZIP-GEV, ZIP-clog log and

ZIP-probit. The analysis results revealed the Zero-inflated Poisson with a GEV link function

to be the best performing model.

Zero-inflated models have also been widely utilized in maternal health studies. Arefaynie

et al. (2022) used ZIP regression in a study to determine the number of antenatal care and

associated factors in Ethiopia. Fitriani et al. (2019) and Loquiha et al. (2013) used ZINB

regression to model maternal mortality in Malang and Mozambique.

9


2.2.3 Logistic Regression Model

The logistic regression model is the most commonly used statistical model for classifying

binary data. It estimates the probability of a binary outcome, independently or dependent on

a set of predictors. It has been broadly utilized in many fields including healthcare Yego et al.

(2014), epidemiology Tolles and Meurer (2016), education Mason et al. (2018), economics

Jabeur (2017), etc. Hurdle models also employ logistic regression to assign the probability

that governs whether a count takes on a zero or a positive value. Models based on various

link functions, including logit, probit, log-log, clog-log, have been proposed for the binary

response estimation, but logistic regression remains the most popular Desjardins (2013). Its

convenient interpretation and implementation makes it an ideal method for modelling binary

response variable Ali (2020). Logistic regression, however, has drawbacks when applied to

the classification of imbalanced binary events.

Research has highlighted this limiting feature of the logistic regression and proposed solutions

to account for class imbalance in binary data. Rahim et al. (2019) applied Synthetic Minority

Over-sampling Technique (SMOTE) sampling to Logistic regression, intending to improve

its classification accuracy in bankruptcy detection. The study results showed that the SMOTE

logistic regression outperformed the standard logistic regression with imbalanced data.

In his study, Wang (2020) investigated the sampling-based interventions for imbalanced

binary classes. The two approaches considered were undersampling the majority class or

oversampling the minority class. The results reveal that undersampling the majority class did

not always penalize the estimations, and oversampling the minority class did not consistently

reduce estimation efficiency.

King and Zeng (2001) proposed a different approach for dealing with imbalanced binary

classes, which involved applying weights and prior correction in the estimation of probabili-

ties and regression coefficients. Their study results showed that the models implementing

the recommended corrections outperformed the existing standard methods. However, the

study’s recommended approach turned out to be over-correcting bias in Maximum Likelihood

Estimations.

10


Maalouf and Siddiqi (2014) developed the Rare Event Weighted Logistic Regression (REWLR)

for classifying large imbalanced data with a rare event. The proposed algorithm applied

weights and regularization terms to achieve better predictive accuracy, counter over-fitting

and reduce bias and variance. Weighted Logistic Regression Approach for Rare Events was

used in a study by Zare et al. (2013) to determine risk factors for female breast cancer where

the choice of REWLR over Logistic Regression was influenced by the rarity of events of

interest in their research. In a comparison study, REWLR proved to perform better than

other algorithms, including the Truncated-Regularized Iteratively Re-weighted Least Squares

algorithm and Truncated-regularized Prior Correction Maalouf et al. (2018). The authors rec-

ommended the application of appropriate corrections and adjustments to Logistic Regression

when data is imbalanced.

2.3 Maternal Mortality in Kenya

Past studies on maternal mortality have aimed at establishing incidences, analyzing trends,

identifying factors that may influence maternal deaths or specific causes of death with the goal

of reducing cases of maternal deaths. One of those studies by Nyaboga (2009) described the

trends, magnitude, contributing factors and causes of maternal mortality in Kenya’s national

referral hospital, KNH. His research identified age, parity, place of delivery, contraceptive use,

ANC attendance, and socioeconomic status as the influential factors for maternal mortality in

the national referral hospital. The specific maternal death causes were outlined in his paper

as HIV, abortion complications, eclampsia, sepsis and postpartum haemorrhage.

Recommendations based on the research were in line with implementing BEmONC or

CEmONC interventions in all healthcare facilities. Emergency Obstetric and Newborn Care

(EmONC) describes a set of interventions that treat leading causes of perinatal and maternal

mortality (Tecla et al., 2017). Basic EmONC (BEmONC) services include: administration

of antibiotics to counter sepsis, anticonvulsants for hypertension disorders, uterotonic for

postpartum haemorrhage, Manual placenta removal, Assisted vaginal delivery, retained

products of conception extraction and neonatal resuscitation. Comprehensive EmONC

11


(CEmONC) services include all BEmONC components in addition to Caesarean section

surgical capability and blood transfusions (Odhiambo and Kinoti, 2019).

2.4 Our Research

From the reviewed research, the question that remains to be explored is how a modified

logistic regression would affect the performance of hurdle models. To the best of the author’s

knowledge, no study has attempted to improve the predictive performance of hurdle models’

binary component by accommodating class imbalance. Based on the problem we have

described so far, the objective of the current research is to improve the performance of

hurdle models nested with rare-event weighted logistic regression when applied to maternal

mortality data.

2.5 Conclusion

This chapter highlighted past research involving zero-inflated data, which elucidated the

decision behind the choice model for the current research and the need for more robust

classification models. The researchers concurred that decision about whether to apply hurdle

models or zero-inflated models in modelling count data with excessive zeros should be guided

by the beliefs about the data-generating mechanism of the zeros Min and Agresti (2005);

Miller (2007); Desjardins (2013). Rose et al. (2006) proposed hurdle models be considered

if there is a chance of zero deflation in the data.

The previous research work also supported the need for better-performing classification

techniques in the hurdle model’s binary component for data with rare events.

12


Chapter 3

Methodology

3.1 Introduction

This chapter details the development of the modified hurdle model which is achieved by

incorporating REWLR for binary component estimations. The model is applied to simulated

data to assess model performance with various proportions of zero counts and a real dataset

to assess factors that influence maternal mortality in Nairobi.

3.2 Research Design

This study aims to improve the performance of hurdle models, and assess the effects of select

obstetric and demographic factors on the number of maternal deaths in Nairobi. The maternal

mortality data utilized for this study was pulled from JPHES, a portal of District Health

Information Software (DHIS2), that streamlines health data reporting. The data contains the

number of maternal deaths and other obstetric and demographic factors recorded in MNCH

facilities in Nairobi between October 2021 and January 2022.

The study also introduces a modified Hurdle model that is based on Mullahy (1986)’s Hurdle

models, and Maalouf and Siddiqi (2014)’s Rare-Events Weighted Logistic Regression model.

The general structure of a hurdle model as proposed by Mullahy (1986) is given by:

P(Yi = yi) =


(1− pi) yi = 0

(pi)
p(yi;λi)

1−p(yi;λi|yi=0) yi > 0;

(3.1)

13


This is the two-part model which uses a logistic regression model to estimate pi and a

zero-truncated count model for the estimation of the zero-truncated count model.

logit(pi) = x1iβ1 and log(λi) = x2iβ2 (3.2)

We obtain the zero-truncated model by excluding the probability that yi = 0 from the count

distribution, which is achieved by dividing the probability mass function of the count model

by 1 minus the probability of a zero count i.e.,

p(yi;λi)

1− p(yi;λi|yi = 0)
. (3.3)

The probability pi of a positive count in hurdle models is typically modeled using a logistic

regression model, presented as:

pi =
eXβi

1+eXβi
= 1

1+e−Xβi
(3.4)

where βi’s are the vector of coefficients, and X is a vector of predictors.

We use MLE to find the parameter estimates of the hurdle model; this is obtained by

separately maximizing the log-likelihood functions of the binary and the zero-truncated

distributions.

The log-likelihood function of the Hurdle model using a logistic regression model for the

binary component is given by:

14


ℓ(β1,β2) = ln
n

∏
i=1

[
(pi)

yi (1− pi)
(1−yi)× p(yi; µi)

1− p(yi; µi|yi = 0)

]
=

n

∑
i=1

[
yi ln pi +(1− yi) ln(1− pi)+ ln

p(yi; µi)

1− p(yi; µi|yi = 0)

]
=

n

∑
i=1

ln1− pi +
n

∑
i=1

yi ln
pi

1− pi
+

n

∑
i=1

ln
p(yi; µi)

1− p(yi; µi|yi = 0)

=
n

∑
i=1

ln1− pi +
n

∑
i=1

yi (xβ )+
n

∑
i=1

ln
p(yi; µi)

1− p(yi; µi|yi = 0)

=
n

∑
i=1

−
(

ln1+ exβ

)
+

n

∑
i=1

yi (xβ )+
n

∑
i=1

ln
p(yi; µi)

1− p(yi; µi|yi = 0)
(3.5)

The maximum likelihood estimate for the binary component is the mean of the y variable

from the n draws, i.e.

pi =
1
n

n

∑
i=1

yi (3.6)

There is no closed form solution to obtain the maximum likelihood estimates for the zero-

truncated component. MLE are therefore obtained by using IRLS method of Newton-Raphson

algorithm to solve the score equations.

3.3 Hurdle-REWLR Model

The proposed model overcomes logistic regression, and hence hurdle models’, weakness in

the case of imbalanced data by adopting regularization, weighting, and bias correction on

logistic regression’s log likelihood function.

The log-likelihood function of the REWLR model introduced by Maalouf and Siddiqi (2014)

is given by:

ℓ(β ) = In
n

∏
i=1

(pi)
w1yi (1− pi)

w0(1−yi)− λ

2
∥β∥2

=−w0

n

∑
i=1

(
ln1+ exβ

)
+(w1 −w0)

n

∑
i=1

yixβ − λ

2
∥β∥2 (3.7)

15


where:

i. ws are the weights applied to counter imbalance in the data, which penalize the

misclassification made by setting a higher class weight to the minority class (positive

counts) while reducing weight for the majority class (zeros).

w1 =
τ

ȳ
; w0 =

(1− τ)

(1− ȳ)
(3.8)

(a) τ is the proportion of (non-zero) events in the population

(b) ȳ is the proportion of (non-zero) events in the sample;

ii. λ

2 ∥β∥2 is a regularization term that introduces a penalty for large values of β hence

avoids overfitting.

The log-likelihood of the binary logistic component and the zero-truncated Poisson or NB

component are estimated separately and then combined for model fit assessments.

Neither the binary nor the zero-truncated components have closed form solutions for max-

imum likelihood estimation. MLEs are thus obtained by using IRLS method of Newton-

Raphson algorithm to solve REWLR and zero truncated Poisson or zero truncated NB score

equations.

3.3.1 Poisson Hurdle-REWLR Model

The Probability Mass Function of the Poisson Hurdle-REWLR Model is given by:

P(Yi = yi) =

 (1− pi) yi = 0

(pi)
e−λiλ

yi
i

(1−e−λi)yi!
yi > 0

(3.9)

16


Model estimates are obtained by maximizing the MLE function of the Poisson Hurdle-

REWLR distribution:

ℓ(β1,β2) = ln
n

∏
i=1

(
(pi)

w1yi (1− pi)
w0(1−yi)− λ

2
∥β∥2 +

e−λiλ
yi
i(

1− e−λi
)

yi!

)

=−w0

n

∑
i=1

(
ln1+ ex1iβ1

)
+(w1 −w0)

n

∑
i=1

yix1iβ1 −
λ

2
∥β∥2

+
n

∑
i=1

(
−λ + yix2iβ2 − ln(1− e−ex2iβ2

)− ln(yi!)
)

(3.10)

∂ℓ(β1,β2)

∂β1
=−w0

n

∑
i=1

(
0+ ex1iβ1

)
x1i +(w1 −w0)

n

∑
i=1

yix1i −0 = 0 (3.11)

∂ℓ(β1,β2)

∂β2
=

n

∑
i=1

(
−0+ yix2i − ex2iβ2(x2i)−0

)
= 0 (3.12)

β̂1 =
(w0 −w1)

nx1i
n ln(yi) (3.13)

β̂2 =
1

x2i
lnyi (3.14)

Since both components have no closed form solutions, MLEs are thus obtained by using

IRLS method of Newton-Raphson algorithm.

3.3.2 Negative Binomial Hurdle-REWLR Model

The Probability Mass Function of the Negative Binomial Hurdle-REWLR Model is given by:

P(Yi = yi) =


(1− pi) yi = 0

pi

1−
(

k
µi+k

)k
Γ(yi+k)
yi!Γ(k)

(
µi

µi+k

)yi
(

k
µi+k

)k
yi > 0;

(3.15)

where the dispersion parameter k is given by 1
α

.

17


Model estimates are obtained by maximizing the MLE function of the Negative Binomial

Hurdle-REWLR distribution:

ℓ(β1,β2) = ln
n

∏
i=1

(
(pi)

w1yi (1− pi)
w0(1−yi)− λ

2
∥β∥2

)
+

1

1−
(

1
1+αµi

)α−1

Γ
(
yi +α−1)

yi!Γ(α−1)

(
αµi

1+αµi

)yi
(

1
1+αµi

)α−1

=−w0

n

∑
i=1

(
ln1+ exβ

)
+(w1 −w0)

n

∑
i=1

yixβ − λ

2
∥β∥2

+
n

∑
i=1

(
lnΓ

(
yi +α

−1)− lnΓ(α−1)− lnyi!−
(
yi +a−1) ln(1+αµi)

+yi lnαµi − ln
[
1− (1+αµi)

−a−1
])

(3.16)

∂ℓ(β1,β2)

∂β1
=−w0

n

∑
i=1

(
0+ ex1iβ1

)
x1i +(w1 −w0)

n

∑
i=1

yix1i −0 = 0 (3.17)

∂ℓ(β1,β2)

∂β2|µ
=

n

∑
i=1

[
yi

µ(1+αµ)
− (1+αµ)α−1−1

(1+αµ)α−1 −1

]
= 0 (3.18)

∂ℓ(β1,β2)

∂β2|α
=

n

∑
i=1

y−1
i

∑
v=0

(
v

v+αv−1

)
+

yi

µ(1+αµ)
+ α−2(1+αµ)−1 log(1+αµ)

(1+αµ)α−1−1

= 0 (3.19)

Since both components have no closed form solutions, MLEs are thus obtained by using

IRLS method of Newton-Raphson algorithm.

3.4 Simulations

Data is simulated under PH and NBH distributions. Simulations are performed using a

combination of sample size and proportion of zeros observed.

The following experimental conditions are applied for the simulation study:

• Zero inflation - 50%, 60%, 75%, and 90%.

18


• Sample size - 200, 1000

• The dispersion parameter value is set at 3

The simulation analysis involves generating data from four different distributions: Negative

Binomial Hurdle, Poisson Hurdle-REWLR, and Negative Binomial Hurdle-REWLR. For

each model, we generate 200 and 1000 random samples with varying zero proportions from

the true model, and then all the models are fit to the simulated datasets. In addition, a

predictor variable is simulated from the Poisson distribution, with a constant mean of 3

across all simulation conditions. The simulated covariate mimics the type of covariates for

the real maternal mortality data, e.g., number of pregnant women attending at least 4 ANC

visits, number of assisted Vaginal deliveries, etc. The dispersion parameter k is used with a

pre-stipulated value of 3.

We use AIC to compare the true and misspecified models in terms of the percentage of

the differences in the AICs for the misspecified models and true model (%∆AIC). Where

∆AIC = AIC(Misspecified model)−AIC(True model). We also compute AUC statistics to

achieve and compare an aggregate measure of performance across all possible classification

thresholds, for the two binary components.

Data is generated in R using the rpois(), rbinom(), rhpois(), rhnbinom() functions from the

stats, actuar and countreg packages. Analyses for the simulated and real data is performed

in R using hurdle() from the pscl package, glm from stats package and vglm() from V GAM

package.

3.5 Maternal Mortality data

The data contains information on obstetric outcomes, including maternal deaths for public

and private facilities in Nairobi that offer MNCH services. It covers the duration of October

2021 to January 2022, containing records for 222 MNCH facilities.

Data is available for at least one facility in all the 17 sub-counties in Nairobi: Westlands,

Dagoretti North, Dagoretti South, Langata, Kibra, Roysambu, Kasarani, Ruaraka, Embakasi

19


South, Embakasi North, Embakasi Central, Embakasi East, Embakasi West, Makadara,

Kamukunji, Starehe and Mathare. Nairobi is a cosmopolitan county and Kenya’s capital city

hence the data offers a good representation of the Kenyan population.

Table 3.1: Variable Definition

Variable Description

MaternalDeaths Number of Maternal deaths in MNCH Nairobi facilities

between October 2021 - January 2022

AssistedDeliveries Number of women who had assisted vaginal deliveries

BreechDelivery Number of women who had breech delivery

CS Number of women who gave birth by caesarian sections

LiveBirths Number of live births

EarlyTeenPreg Number of adolescents (10-14 years) pregnant at 1st ANC visit

LateTeenPreg Number of adolescents (15-19 years) pregnant at 1st ANC visit

NormalDeliveries Number of women who had normal deliveries

ANC4Visits Number of women who have attended at least 4 ANC visits

Uterotonics3stg Number of women giving birth who received uterotonics in the third

stage of (or immediately after birth)

Carbatosin Number of Mothers given uterotonics within 1 minute (Carbatosin)

Oxytocin Number of Mothers given uterotonics within 1 minute (Oxytocin)

AntHaemorrage Number of women who had Ante partum Haemorrage

PostHaemorrage Number of women who had Post Partum Haemorrage

ObstructedLabour Number of women who had Obstructed Labour

Eclampsia Number of women who had Eclampsia

RupturedUterus Number of women who had Ruptured Uterus

Sepsis Number of women who had sepsis

FGMComplicatons Number of Mothers with delivery complications associated with FGM

Stillbirth Number of women who had Macerated stillbirth

20


The response variable for this research is the number of maternal deaths reported in MNCH

Nairobi facilities between October 2021 - January 2022. The predictors consist of obstetric

factors, maternal complications, and demographic factors that previous literature suggested

influence maternal deaths.

3.6 Model selection

The study uses Akaike information criterion (AIC) to compare the model fit between the

modified hurdle models and the standard Hurdle models.

AIC is computed as:

AIC =−2log(L)+2K;

where L is the likelihood, and K is the number of parameters in the model.

AIC evaluates how well a model fits the data from which it was generated. The best-fitting

model yields the lowest AIC values.

Area under the curve (AUC), of the receiver operating characteristic (ROC) is also computed

for both binary components to show the performance of the classification models. AUC

measures the ability of the classification algorithm to distinguish between classes. Higher

values of AUC imply better model performance.

21


Chapter 4

Results and Interpretation

4.1 Simulation

In this section, we present the theoretical results of the study models through simulation

analyses performed on the Poisson Hurdle-REWLR, NB Hurdle-REWLR and Poisson Hurdle,

and NB Hurdle models.

For the simulation analysis, we first evaluated the performance of HNB, HNB-REWLR

and HP-REWLR models when the data are simulated from a Hurdle Poisson model. The

Hurdle-REWLR models reported the lowest AIC values among all the other models. The

percentage differences in the AICs between the misspecified models and the Poisson Hurdle

models increased as the proportions of zero in the data increased. This was the trend for both

the small (n=200) and large (n=1000) sample sizes.

Analysis on NB Hurdle data resulted in NB Hurdle REWLR outperforming NB Hurdle

at 60%, 75% and 90% zero inflation for both small and large sample sizes. NB Hurdle

outperformed the other models only when the data had a 50% zero inflation.

Data generated by the study model distributions, Poisson Hurdle-REWLR and NB Hurdle-

REWLR, performed best on the true models, based on AIC statistics. In the Poisson Hurdle

REWLR generated data, the lowest AIC values were recorded by the same model, for all

the simulation conditions. In a small sample size, the model performed best in data with

75% zero inflation while in large sample size, Poisson Hurdle-REWLR performance is best

in 60% inflation data. The least percentage change in AIC was achieved by the NB Hurdle

REWLR model. Models fit on the NB Hurdle-REWLR simulated data achieved the lowest

AIC. The least percentage change in AIC was achieved by the NB Hurdle model.

22


Table 4.1: AIC (Percentage Change in AIC) for Misspecified and Actual Models

Reference n Zeros PH PHRE NBH NBHRE
PH 200 0.50 418 402 ( -3.98) 420 ( 0.43) 403 ( -3.63)
PH 200 0.60 512 479 ( -6.88) 514 ( 0.3) 480 ( -6.64)
PH 200 0.75 519 478 ( -8.68) 521 ( 0.37) 480 ( -8.16)
PH 200 0.90 582 527 ( -10.43) 584 ( 0.42) 529 ( -9.92)
PH 1000 0.50 2252 2136 ( -5.44) 2254 ( 0.09) 2140 ( -5.23)
PH 1000 0.60 2473 2302 ( -7.43) 2474 ( 0.04) 2303 ( -7.36)
PH 1000 0.75 2730 2513 ( -8.65) 2732 ( 0.07) 2515 ( -8.56)
PH 1000 0.90 2855 2592 ( -10.17) 2857 ( 0.07) 2593 ( -10.1)
PHRE 200 0.50 578 ( 9.53) 523 580 ( 9.84) 525 ( 0.36)
PHRE 200 0.60 599 ( 9.13) 544 600 ( 9.29) 545 ( 0.26)
PHRE 200 0.75 562 ( 10.03) 506 564 ( 10.35) 509 ( 0.67)
PHRE 200 0.90 587 ( 9.61) 531 589 ( 9.92) 534 ( 0.5)
PHRE 1000 0.50 2807 ( 9.61) 2537 2809 ( 9.68) 2542 ( 0.18)
PHRE 1000 0.60 2952 ( 9.4) 2675 2954 ( 9.44) 2676 ( 0.06)
PHRE 1000 0.75 2976 ( 9.32) 2699 2978 ( 9.38) 2701 ( 0.07)
PHRE 1000 0.90 2921 ( 9.5) 2644 2923 ( 9.54) 2645 ( 0.04)
NBH 200 0.50 374 ( 7.56) 380 ( 8.98) 346 352 ( 1.61)
NBH 200 0.60 499 ( 5.84) 488 ( 3.76) 470 460 ( -2.27)
NBH 200 0.75 562 ( 7.87) 540 ( 4.11) 518 496 ( -4.47)
NBH 200 0.90 649 ( 6.26) 617 ( 1.39) 608 576 ( -5.57)
NBH 1000 0.50 2054 ( 5.9) 2069 ( 6.56) 1933 1948 ( 0.76)
NBH 1000 0.60 2161 ( 5.13) 2144 ( 4.37) 2050 2033 ( -0.85)
NBH 1000 0.75 2864 ( 8.81) 2767 ( 5.61) 2612 2514 ( -3.88)
NBH 1000 0.90 3128 ( 9.65) 2984 ( 5.29) 2826 2682 ( -5.38)
NBHRE 200 0.50 614 ( 12.66) 588 ( 8.78) 562 ( 4.58) 536
NBHRE 200 0.60 685 ( 11.4) 649 ( 6.45) 643 ( 5.58) 607
NBHRE 200 0.75 679 ( 14.28) 640 ( 9.12) 621 ( 6.26) 582
NBHRE 200 0.90 680 ( 12.82) 645 ( 8.02) 629 ( 5.69) 593
NBHRE 1000 0.50 3337 ( 12.58) 3180 ( 8.27) 3074 ( 5.11) 2917
NBHRE 1000 0.60 3044 ( 11.13) 2896 ( 6.59) 2853 ( 5.18) 2705
NBHRE 1000 0.75 3426 ( 14.11) 3245 ( 9.3) 3125 ( 5.81) 2943
NBHRE 1000 0.90 3279 ( 14.71) 3111 ( 10.08) 2966 ( 5.69) 2797

23


Overall, NB Hurdle REWLR outperformed the Poisson Hurdle, Poisson Hurdle REWLR and

NB Hurdle models. Plots of the resulting AIC values are presented in figure 4.1 and figure

4.2 for Poisson Hurdle simulated data, figure 4.3 figure 4.4 for Poisson Hurdle REWLR

simulated data, figure 4.5 and figure 4.6 for NB Hurdle simulated data, figure 4.7 and figure

4.7 for NB Hurdle REWLR simulated data.

Figure 4.1: AICs from models fit on Poisson Hurdle simulated data, n = 200

24


Figure 4.2: AICs from models fit on Poisson Hurdle simulated data, n = 1000

Figure 4.3: AICs from models fit on Poisson Hurdle-RE simulated data, n = 200

25


Figure 4.4: AICs from models fit on Poisson Hurdle-RE simulated data, n = 1000

Figure 4.5: AICs from models fit on NB Hurdle simulated data, n = 200

26


Figure 4.6: AICs from models fit on NB Hurdle simulated data, n = 1000

Figure 4.7: AICs from models fit on NB Hurdle-RE simulated data, n = 200

27


Figure 4.8: AICs from models fit on NB Hurdle-RE simulated data, n = 1000

4.2 Application to Maternal Deaths Data

4.2.1 Descriptive Statistics

The study sample data reported 293 maternal deaths of the 53792 recorded live births. The

sample variance of 3.758 exceeds the sample mean of 1.32, indicating overdispersed data.

The data also exhibits zero inflation as 61.71% of the dependent variable counts are zero.

Table 4.2 below exhibits the average counts for some of the obstetric factors used as covariates

in this study. The facilities which reported maternal deaths had higher average counts of all

the conditions. For instance, the number of Mothers given uterotonics was higher in the group

that experienced maternal deaths. Stillbirth occurrence was also primarily associated with

maternal death. Correlation analysis between the maternal deaths and the predictors revealed

that Maternal deaths was highly correlated with BreechDelivery (r = 0.7342), Uterotonics3stg

(r = 0.9615), Oxytocin (r = 0.9615), Carbatosin (r = 0.9615), AntHaemorrage (r = 0.9743),

Eclampsia (r = 0.9742), ObstructedLabour (r = 0.9741), PostHaemorrage (r = 0.9741),

28


Figure 4.9: Maternal Death Counts

FGMComplicatons (r = 0.9742), RupturedUterus (r = 0.9744), Sepsis (r = 0.9740), and

Stillbirth (r = 0.9764).

4.2.2 Maternal Death Models

Prior to formulating the count models, we compute the weights for the binary Hurdle-REWLR

models. The weights penalize the misclassification made by setting a higher class weight to

the positive counts while reducing weight for the zero counts.

The weights are calculated as outlined by Maalouf and Siddiqi (2014):

29


Table 4.2: Average Count of Obstetric Conditions reported in facilities with and without
reported maternal deaths

Factor No Maternal Deaths Maternal Deaths
BreechDelivery 0.3 2.7
CS 15.0 174.7
LiveBirths 44.7 560.8
EarlyTeenPreg 2.3 22.9
LateTeenPreg 12.7 60.0
NormalDeliveries 36.5 404.2
ANC4Visits 42.7 272.8
Uterotonics3stg 96.2 1139.7
Carbatosin 9.5 112.9
Oxytocin 75.0 889.0
AntHaemorrage 7.5 94.8
Eclampsia 2.2 28.5
ObstructedLabour 0.7 10.4
PostHaemorrage 3.4 42.4
FGMComplicatons 1.5 18.8
RupturedUterus 1.7 21.8
Sepsis 0.4 6.1
Stillbirth 0.0 0.6

w1 =
τ

ȳ
; w0 =

(1− τ)

(1− ȳ)
(4.1)

We have 293 deaths for the 53792 live births from our sample data. The latest report by the

Kenya Ministry of Health on Health and Health-related SDGs revealed the latest deaths per

live birth ratio reported in Nairobi as 97.

Ȳ =
293

53792
= 0.0054 τ =

97
100000

= 0.00097

w1 =
0.00097
0.0054

= 0.1796 w0 =
(1−0.00097)
(1−0054)

= 1.0045

30


We use correlation analysis to select the predictors to use for the analysis. Predictors with

high correlations are more linearly dependent and thus have the same effect on the dependent

variables.

The factors that influence observing a maternal death in the facility are attending at least

4 ANC visits, antepartum haemorrhage, and receiving uterotonics during or immediately

after birth. Specific effects of these factors on the various models are presented in Table 4.3.

Upon observing a maternal death, the determinants of the actual number of maternal deaths

that a facility could report are the occurrence of macerated stillbirth, attending of at least 4

ANC visits, adolescent pregnancies, antepartum hemorrhage, breech deliveries, postpartum

hemorrhage, receiving Carbatosin and giving birth by cesarean section. The coefficients of

the count model component is presented in Table 4.4.

Table 4.3: Binary Component Coefficients

PH.BINARY NBH.BINARY PHRE.BINARY NBHRE.BINARY
(Intercept) -3.9561920 -3.9561920 8.2592952 8.2591187
Stillbirth 21.4971965 21.4971965 -0.0242153 -0.0529712
ANC4Visits -0.0021931 -0.0021931 0.0260967 0.0261088
LateTeenPreg -0.0072837 -0.0072837 0.0091975 0.0091929
AntHaemorrage 1.2112676 1.2112676 -4.6352784 -4.6355084
PostHaemorrage -0.1450389 -0.1450389 0.2713996 0.2708596
BreechDelivery 0.1438486 0.1438486 -0.0180133 -0.0183037
Carbatosin -0.0103415 -0.0103415 -0.1092431 -0.1002919
Uterotonics3stg -0.0785952 -0.0785952 0.3119054 0.3110578

Table 4.4: Count Component Coefficients

PH.COUNT NBH.COUNT PHRE.COUNT NBHRE.COUNT
(Intercept) -0.0475 -0.0475 0.6303 -0.0444
Stillbirth 0.9126 0.9127 -0.0051 0.9126
ANC4Visits 0.0011 0.0011 0.0386 0.0011
LateTeenPreg 0.0023 0.0023 0.0006 0.0023
AntHaemorrage -0.0307 -0.0307 0.0018 -0.0304
PostHaemorrage 0.0302 0.0302 NA 0.0305
BreechDelivery 0.0317 0.0317 NA 0.0318
Carbatosin 0.2087 0.2087 NA 0.2131
Uterotonics3stg -0.0196 -0.01961 NA -0.0201

31


Table 4.5 shows the resulting AICs following the fit of Poisson, Negative Binomial, Poisson

Hurdle, NB Hurdle, Poisson Hurdle REWLR, NB Hurdle REWLR models to the maternal

mortality data. NB Hurdle REWLR produced the lowest AIC, indicating a better fit than the

other count models.

Table 4.5: AIC for Maternal Mortality Models

Model AIC
Poisson 469.6684
PH 335.9051
PH-RE 370.6200
NB 473.5588
NB-H 337.9054
NBH-RE 284.1434

ROC and the corresponding AUC values were obtained as shown in Figure 4.10. The Hurdle

models employed logistic regression algorithm for classification in the binary component

while the Hurdle-REWLR used REWLR algorithm. Both models scored highly on AUC with

values close to 1, an indication of good model performance. The classification algorithm

introduced by the study’s models emerged the better performing algorithm, with higher AUC

scores.

It was also of interest to the study how the Hurdle-REWLR predicted zero counts compared

to their counterpart standard Hurdle models. From Table 4.6, we observe that Poisson Hurdle

and NB Hurdle models accurately predicted the observed number of zero counts in the

sample data. The predicted zero counts from the sample data were slightly less than that

observed in the sample data.

Table 4.6: Observed and Expected Zero Counts

Oberved Poisson PH PH-RE NB NB-H NBH-RE

137 122 137 102 126 137 102

32


Figure 4.10: ROC-AUC for the various models

33


Chapter 5

Discussion, Conclusion and

Recommendation

5.1 Introduction

This section presents an interpretation of the study’s research findings in relation to findings

made by previous researchers on the same topic. We further draw conclusions and make

recommendations based on our study outputs.

5.2 Discussion

Cases of rare events in count data where the proportion of zero counts is significantly less than

that of the natural numbers have been shown to influence binary estimations in zero-inflated

count models. Theoretically, more extreme rare events are expected to impose extreme bias

towards the majority group, i.e., zero counts. Because of this hypothesis, the current study

conducted extensive simulation and analysis with varying proportions of zeros and sample

sizes to evaluate the performance of the Hurdle-REWLR models in the various simulation

conditions. The study also evaluated the performance of the study models alongside the

standard hurdle models when fit on maternal mortality data to determine the factors which

influence maternal mortality in Nairobi, Kenya.

The analysis to determine factors which influence maternal deaths in Nairobi resulted in

NB Hurdle-REWLR outperforming the other models in terms of the Akaike Information

Criterion. The Hurdle-REWLR models adjusted for the population estimates by introducing

34


weights and regularizing the coefficients. The predicted zero counts from the sample data

emerged to be slightly less than observed. The number estimated by the Hurdle-REWLR

models could be expected from a sample that accurately represents the Nairobi population.

The introduction of the weights makes the Hurdle-REWLR models ideal for estimations and

inference.

In both the Hurdle and hurdle-REWLR models, specific demographic and obstetric factors

significantly affected the response variable. Age, described by the number of pregnant ado-

lescents, and attendance of at least 4 ANC visits are some of the demographic factors shown

by past literature to influence maternal deaths. In addition, childbirth-related conditions of

postpartum haemorrhage, treated by uterotonics, antepartum haemorrhage, breech delivery

and Macerated stillbirth were also discovered to influence the number of maternal deaths.

Haemorrhage is among the obstetric factors which were highlighted by (Organization et al.,

2019) to be some of the causes of maternal death globally. These findings were also in line

with Nyaboga (2009) research which outlined the influential factors and causes of maternal

mortality in Kenya’s national referral hospital, KNH. Some of the factors identified in their

research which the current study has outlined, include age, ANC attendance, and Postpartum

haemorrhage.

Simulation analysis findings revealed NB Hurdle-REWLR to produce the lowest AIC value

compared to the other models. The percentage difference in AICs between NB Hurdle-

REWLR and the other misspecified models increased as the zeros in the data increased. The

Poisson Hurdle-REWLR model outperformed the NB Hurdle in Poisson Hurdle REWLR

simulated data but was inferior in NB Hurdle simulated data. It could not account for the extra

dispersion introduced in the NB simulated data. The Hurdle-REWLR models, through their

binary component, also outperformed the standard Hurdle models, in terms of ROC-AUC

statistics. The classification algorithm used for the Hurdle-REWLR performed better at

classifying imbalanced data.

The selection of NB Hurdle REWLR as the ideal model over the standard NB Hurdle model

was influenced by the degree of zero inflation in the simulation analysis. The two models

gave almost similar results when the proportions of zeros and non-zeros in the scenarios

35


where the data were not significantly different. For instance, in the NB Hurdle simulated

data, the model performed better than the NB Hurdle REWLR model at 50% zero inflation

but was outperformed for the subsequent degrees of zero inflation of 60%, 75% and 90%.

This outcome conformed to the basic concept of the Hurdle-REWLR model. As outlined

by Maalouf and Siddiqi (2014), REWLR is modified from logistic regression with the aim

of unbiased prediction in rare events with imbalanced data. If the proportions of zero and

non-zero counts are balanced, REWLR is not expected to outperform logistic regression.

Despite their foundations on similar concepts, the performance of the Poisson Hurdle REWLR

was inferior to that of the NB Hurdle REWLR model. In the simulation analysis, Poisson

Hurdle REWLR outperformed its counterpart in data generated by its distribution, and the

Poisson Hurdle simulated data only by small units of percentage change in AIC. In the

other simulation scenarios, the NB Hurdle REWLR model claimed superiority by quite huge

margins of the percentage change in AIC. In addition, NB Hurdle REWLR was the best

performing model for the fit on maternal mortality data. Such results have been witnessed in

various performance comparison studies including Fenta et al. (2020) and Mamun (2014).

It is common for the NB model to outperform its Poisson counterpart when there is some

dispersion in the data.

In the evaluation to assess how the Hurdle-REWLR predicted zero counts compared to the

Hurdle models, the hurdle models predicted the exact number of zeros available in the sample

data. The binary component of the Hurdle models uses logistic regression to predict the zero

counts. The prediction accuracy can thus be attributed to the bias towards the majority class.

Rahim et al. (2019) assessed the performance of SMOTE logistic as a classifier in rare events

data and revealed a similar outcome where SMOTE logistic regression approach was more

accurate compared to the logistic regression model but was outperformed by the latter in test

prediction accuracy.

The Negative Binomial hurdle REWLR model was selected based on the Akaike information

criterion. The model was then fit to the maternal mortality data. The covariate factors that

were significantly associated with maternal deaths at the binary level include attendance of at

least 4 ANC visits, antepartum haemorrhage, and receiving uterotonics during or immediately

36


after birth. Upon observing maternal death within a facility, the covariate factors influencing

the number of maternal deaths reported are Macerated stillbirth, attendance of at least 4

ANC visits, adolescent pregnancies, antepartum haemorrhage, breech deliveries, postpartum

haemorrhage, receiving Carbatosin and giving birth by cesarean section.

The Hurdle-REWLR model has an advantage over the Hurdle models because of their

ability to introduce weights hence producing more accurate estimates that can be used for

inference of population parameters. When the zero-inflated sample accurately represents

the population, choosing between these two groups of models could be based on Akaike

Information Criterion.

5.3 Conclusion

The main aim of this study was to create Poisson and NB Hurdle-REWLR models for zero-

inflated data and evaluate their performance in comparison to the standard Hurdle models.

The Hurdle-REWLR in their binary component accounted for an imbalance between majority

and minority proportions. That was the differentiating factor between the two models.

The proposed study models were then applied to simulated and maternal mortality data,

where NB Hurdle-REWLR outperformed the other models. The difference in AIC based

performance between the NB Hurdle REWLR model and the other models increased with an

increase in the degree of zero inflation. The ideal model performed better in cases of class

imbalance. The study findings also highlighted a case of biased classification. The binary

component of the Hurdle model, using logistic regression, classified all the observed zero

counts in the maternal mortality as zeroes. Despite the prediction being an exact fit, the NB

Hurdle model was inferior in AIC measures. NB Hurdle REWLR was thus selected as the

ideal model in rare event cases where class imbalance exists.

The study further outlined factors influencing maternal deaths in Nairobi: adolescent preg-

nancy, attendance of at least 4 ANC visits, postpartum haemorrhage, antepartum haemor-

37


rhage, breech delivery, and Macerated stillbirth. Most of these factors have been identified as

determinants or causes of maternal deaths literature reviewed by this study.

Findings from this research are expected to provide reliable estimates of the number of

maternal deaths in Nairobi, Kenya. Without the risk of overfitting zero counts, researchers

will be able to realize the actual maternal mortality ratio and the factors associated with zero

maternal death counts. The research results will assist in supporting existing policies and

developing new programs and interventions to reduce the number of deaths due to childbirth

and maternity.

5.4 Recommendation

5.4.1 Recommendation for further research

One area for further research is the implementation of the Hurdle-REWLR models on

normally distributed covariates. The covariates of the current study data consisted of count

data, majority being zero-inflated just as the dependent variable; this limited the covariate

effect on the dependent variable.

5.4.2 Policy recommendation

This study recommends that the proposed interventions be implemented to halt any avoidable

deaths of women during and immediately after childbirth. These interventions, such as

the implementation of BEmONC or CEmONC has yet to be rolled out in all healthcare

facilities. Actualizing this would go a long way in preventing maternal deaths due to obstetric

conditions. Maternal deaths due to demographic and social factors such as adolescent

pregnancies and attendance of ANC visits can be countered by educating the public on all

the associated risks of these practices or lack-off.

38


References
Ali, E. (2020). Zero-inflated poisson regression model for a new class of flexible link

functions: A case study on healthcare utilization.

Arefaynie, M., Kefale, B., Yalew, M., Adane, B., Dewau, R., and Damtie, Y. (2022). Number
of antenatal care utilization and associated factors among pregnant women in ethiopia:
zero-inflated poisson regression of 2019 intermediate ethiopian demography health survey.
Reproductive Health, 19(1):1–10.

Aryuyuen, S., Bodhisuwan, W., and Supapakorn, T. (2014). Zero inflated negative binomial-
generalized exponential distribution and its applications. Songklanakarin Journal of
Science and Technology, 36(4):483–491.

Chaudhari, M., Hubbard, R., Reid, R. J., Inge, R., Newton, K. M., Spangler, L., and Barlow,
W. E. (2012). Evaluating components of dental care utilization among adults with diabetes
and matched controls via hurdle models. BMC oral health, 12(1):1–12.

Desjardins, C. D. (2013). Evaluating the performance of two competing models of school
suspension under simulation-the zero-inflated negative binomial and the negative binomial
hurdle. University of Minnesota.

Diop, A., Deme, E. H., and Diop, A. (2021). Zero-inflated generalized extreme value
regression model for binary data and application in health study. arXiv preprint
arXiv:2105.00482.

Fenta, S. M. and Fenta, H. M. (2020). Risk factors of child mortality in ethiopia: application
of multilevel two-part model. PLoS One, 15(8):e0237640.

Fenta, S. M., Fenta, H. M., and Ayenew, G. M. (2020). The best statistical model to estimate
predictors of under-five mortality in ethiopia. Journal of Big Data, 7(1):1–14.

Fitriani, R., Chrisdiana, L. N., and Efendi, A. (2019). Simulation on the zero inflated negative
binomial (zinb) to model overdispersed, poisson distributed data. In IOP Conference
Series: Materials Science and Engineering, volume 546, page 052025. IOP Publishing.

Greene, W. H. (1994). Accounting for excess zeros and sample selection in poisson and
negative binomial regression models.

Hilbe, J. M. (2011). Negative binomial regression. Cambridge University Press.

Hilbe, J. M. (2014). Modeling count data. Cambridge University Press.

Hutchinson, M. K. and Holtman, M. C. (2005). Analysis of count data using poisson
regression. Research in nursing & health, 28(5):408–418.

Jabeur, S. B. (2017). Bankruptcy prediction using partial least squares logistic regression.
Journal of Retailing and Consumer Services, 36:197–202.

Kibika, S. A. (2020). The Zero Inflated Negative Binomial-Shanker distribution and its
application to HIV exposed infant data. PhD thesis, Strathmore University.

39


King, G. (1989). Event count models for international relations: Generalizations and
applications. International Studies Quarterly, 33(2):123–147.

King, G. and Zeng, L. (2001). Logistic regression in rare events data. Political analysis,
9(2):137–163.

Lambert, D. (1992). Zero-inflated poisson with an regression, in manufacturing to defects
application. Technometrics, 34(1):14.

Loquiha, O., Hens, N., Chavane, L., Temmerman, M., and Aerts, M. (2013). Modeling het-
erogeneity for count data: A study of maternal mortality in health facilities in mozambique.
Biometrical Journal, 55(5):647–660.

Maalouf, M., Homouz, D., and Trafalis, T. B. (2018). Logistic regression in large rare
events and imbalanced data: A performance comparison of prior correction and weighting
methods. Computational Intelligence, 34(1):161–174.

Maalouf, M. and Siddiqi, M. (2014). Weighted logistic regression for large-scale imbalanced
and rare events data. Knowledge-Based Systems, 59:142–148.

Mamun, M. A. A. (2014). Zero-inflated regression models for count data: an application to
under-5 deaths.

Mason, C., Twomey, J., Wright, D., and Whitman, L. (2018). Predicting engineering
student attrition risk using a probabilistic neural network and comparing results with a
backpropagation neural network and logistic regression. Research in Higher Education,
59(3):382–400.

McDowell, A. (2003). From the help desk: hurdle models. The Stata Journal, 3(2):178–184.

Miller, J. M. (2007). Comparing Poisson, Hurdle, and ZIP model fit under varying degrees
of skew and zero-inflation. PhD thesis, University of Florida.

Min, Y. and Agresti, A. (2005). Random effect models for repeated measures of zero-inflated
count data. Statistical modelling, 5(1):1–19.

Mullahy, J. (1986). Specification and testing of some modified count data models. Journal
of econometrics, 33(3):341–365.

Mwangi, A., Nangami, M., Tabu, J., Ayuku, D., Were, E., and Fabian, E. (2019). A system
approach to improving maternal and child health care delivery in kenyan communities
and primary care facilities: baseline survey on maternal health. African Health Sciences,
19(2):1841–1848.

Neelon, B., Chang, H. H., Ling, Q., and Hastings, N. S. (2016). Spatiotemporal hurdle
models for zero-inflated count data: exploring trends in emergency department visits.
Statistical methods in medical research, 25(6):2558–2576.

Nekesa, F. V. (2019). Distributions of zero-inflated models with application to HIV exposed
infants. PhD thesis, Strathmore University.

Nusinovici, S., Tham, Y. C., Yan, M. Y. C., Ting, D. S. W., Li, J., Sabanayagam, C., Wong,
T. Y., and Cheng, C.-Y. (2020). Logistic regression was as good as machine learning for
predicting major chronic diseases. Journal of clinical epidemiology, 122:56–69.

40


Nyaboga, E. O. (2009). Maternal mortality at Kenyatta National’hospital (Nairobi, Kenya)
2000-2008. PhD thesis.

Odhiambo, C. and Kinoti, F. (2019). Evaluation and comparison of patterns of maternal
complications using generalized linear models of count data time series. International
Journal of Statistics in Medical Research, 8:32–39.

Organization, W. H. et al. (2019). Trends in maternal mortality 2000 to 2017: estimates by
who, unicef, unfpa, world bank group and the united nations population division.

Rahim, A. H. A., Rashid, N. A., Nayan, A., and Ahmad, A.-R. (2019). Smote approach to
imbalanced dataset in logistic regression analysis. In Proceedings of the Third Interna-
tional Conference on Computing, Mathematics and Statistics (iCMS2017), pages 429–433.
Springer.

Rose, C. E., Martin, S. W., Wannemuehler, K. A., and Plikaytis, B. D. (2006). On the use of
zero-inflated and hurdle models for modeling vaccine adverse event count data. Journal of
biopharmaceutical statistics, 16(4):463–481.

Smith, H., Ameh, C., Godia, P., Maua, J., Bartilol, K., Amoth, P., Mathai, M., and van den
Broek, N. (2017). Implementing maternal death surveillance and response in kenya:
incremental progress and lessons learned. Global Health: Science and Practice, 5(3):345–
354.

Tecla, S. J., Franklin, B., David, A., and Jackson, T. K. (2017). Assessing facility readiness
to offer basic emergency obstetrics and neonatal care (bemonc) services in health care
facilities of west pokot county, kenya. J Clin Simul Res, 7:25–39.

Tolles, J. and Meurer, W. J. (2016). Logistic regression: relating patient characteristics to
outcomes. Jama, 316(5):533–534.

Wang, H. (2020). Logistic regression for massive data with rare events. In International
Conference on Machine Learning, pages 9829–9836. PMLR.

Yego, F., D’este, C., Byles, J., Williams, J. S., and Nyongesa, P. (2014). Risk factors for
maternal mortality in a tertiary hospital in kenya: a case control study. BMC pregnancy
and childbirth, 14(1):1–9.

Zare, N., Haem, E., Lankarani, K. B., Heydari, S. T., and Barooti, E. (2013). Breast cancer
risk factors in a defined population: weighted logistic regression approach for rare events.
Journal of breast cancer, 16(2):214–219.

Zhen, Z., Shao, L., and Zhang, L. (2018). Spatial hurdle models for predicting the number of
children with lead poisoning. International journal of environmental research and public
health, 15(9):1792.

Ziraba, A. K., Madise, N., Mills, S., Kyobutungi, C., and Ezeh, A. (2009). Maternal mortality
in the informal settlements of nairobi city: what do we know? Reproductive health,
6(1):1–8.

41


Appendix A

R CODES
The R code used for simulations and model fitting in Chapter 4.

A.1 Libraries

library(ggplot2)

library(sandwich)

library(msm)

library(dplyr)

library(tidyr)

library(vcd)

library(countreg)

library(pscl)

library(VGAM)

library(rewlr)

library(kableExtra)

library(readxl)

library(gridExtra)

library(glmnet)

library(plotrix)

library(ZIM)

library(tidyverse)

42


A.2 Simulations and Analysis

#Generate data from Poisson Hurdle distribution.

#Repeat for Poisson Hurdle-REWLR, NB Hurdle, and NB Hurdle-REWLR.

# Probability of 0 = 1-p

#Set the seed for reproducible results

set.seed(2345)

#Assigned weights

w1 = 3.5

w0 = 1

#Zero-altered poisson random number generator function

zero.aic.func <- function(n, pi, zero.prop) {

rhpois <- function(n=n, mu, zprob){

ifelse(rbinom(n, 1, zprob) == 1, 0, rpois(n, mu))

}

Y <- rhpois(n, mu = 1.3, zprob = pi) #Poisson Hurdle

X <- rpois(n, 3.5)

dsname <- data.frame(Y, X)

#Poisson Regression

model1 <- glm(Y ~ X, family="poisson", data=dsname)

aic1 <- summary(model1)$aic

#Poisson Hurdle Regression

model2 <- hurdle(Y ~ X, data=dsname, dist = "poisson", link="logit")

aic2 <- AIC(model2)

43


#REWLR-Hurdle Poisson Regression

model3.a <- vglm(Y[Y > 0] ~ X[Y > 0], family = pospoisson(), data=dsname)

model3.b <- rewlr(I(Y > 0) ~ X, weights0 = w0, weights1 = w1, data=dsname)

aic.val3.a <- AICvlm(model3.a)

aic.val3.b <- model3.b$aic

#Negative Binomial Regression

model4 <- glm.nb(Y ~ X, data=dsname)

aic4 <- summary(model4)$aic

#Negative Binomial Hurdle Regression

model5 <- hurdle(Y ~ X, dist = "negbin")

aic5 <- AIC(model5)

#REWLR-Hurdle Negative Binomial Regression

model6.a <- vglm(Y[Y > 0] ~ X[Y > 0], family = posnegbinomial(), data=dsname)

model6.b <- rewlr(I(Y > 0) ~ X, weights0 = w0, weights1 = w1, data=dsname)

aic.val6.a <- AICvlm(model6.a)

aic.val6.b <- model6.b$aic

aic6 <- aic.val6.a + aic.val6.b

#AIC Values based on arious zero-proportions

aic.values <- data.frame(AIC=rbind(aic1, aic2, aic3, aic4, aic5, aic6),

Model=c("Poisson", "PH", "PH-RE","NB", "NB-H", "NBH-RE"),

Zero.Proportion=c(rep(pi, 6)), row.names = NULL)

#Figure: Performance based on AIC values for sets of models, sample size, zero%

plot <- ggplot(aic.values, aes(x=Model, y=AIC))+

geom_bar(stat = "identity", fill="azure3")+

geom_text(aes(label=round(AIC, digits=0)), vjust=1.6, color="black", size=3.5)+

44


scale_x_discrete(limits=c("Poisson", "NB", "PH", "NB-H", "PH-RE", "NBH-RE"))+

labs(y = "AIC Value", x = "Count Model")+

ggtitle(zero.prop)+

theme_bw()

return(plot)

}

plot1 <- zero.aic.func(200, 0.50, "n=200; 50% zero-inflation")

plot2 <- zero.aic.func(200, 0.40, "n=200; 60% zero-inflation")

plot3 <- zero.aic.func(200, 0.25, "n=200; 75% zero-inflation")

plot4 <- zero.aic.func(200, 0.10, "n=200; 90% zero-inflation")

png(file="D:/MSc/Thesis/Thesis-template-20220215T185005Z-001/

Thesis-template/Figs/PH200.png", width=600, height=350)

grid.arrange(

plot1, plot2, plot3, plot4,

ncol=2, nrow = 2)

dev.off()

plot5 <- zero.aic.func(1000, 0.50, "n=1000; 50% zero-inflation")

plot6 <- zero.aic.func(1000, 0.40, "n=1000; 60% zero-inflation")

plot7 <- zero.aic.func(1000, 0.25, "n=1000; 75% zero-inflation")

plot8 <- zero.aic.func(1000, 0.10, "n=1000; 90% zero-inflation")

png(file="D:/MSc/Thesis/Thesis-template-20220215T185005Z-001/

Thesis-template/Figs/PH500.png", width=600, height=350)

grid.arrange(

plot5, plot6, plot7, plot8,

ncol=2, nrow = 2)

45


dev.off()

plot9 <- zero.aic.func(500, 0.50, "n=500; 50% zero-inflation")

plot10 <- zero.aic.func(500, 0.40, "n=500; 60% zero-inflation")

plot11 <- zero.aic.func(500, 0.25, "n=500; 75% zero-inflation")

plot12 <- zero.aic.func(500, 0.10, "n=500; 90% zero-inflation")

png(file="D:/MSc/Thesis/Thesis-template-20220215T185005Z-001/

Thesis-template/Figs/PH1000.png", width=600, height=350)

grid.arrange(

plot9, plot10, plot11, plot12,

ncol=2, nrow = 2)

dev.off()

#Compute Percentage change in AIC

aic.chg.func <- function(n, pi) {

# Zero-altered poisson random number generator

rhpois <- function(n=n, mu, zprob){

ifelse(rbinom(n, 1, zprob) == 1, 0, rpois(n, mu))

}

Y <- rhpois(n, mu = 1.3, zprob = pi) #Poisson Hurdle

X <- rpois(n, 3.5) # Independent variable X

dsname <- data.frame(Y, X)

#Poisson Regression

model1 <- glm(Y ~ X, family="poisson", data=dsname)

aic1 <- summary(model1)$aic

46


#Poisson Hurdle Regression

model2 <- hurdle(Y ~ X, data=dsname, dist = "poisson", link="logit")

aic2 <- AIC(model2)

## REWLR-Hurdle Poisson Regression

# Error due to vglm: https://bookdown.org/fxpalacio/bookdown_curso/GLM.html

model3.a <- vglm(Y[Y > 0] ~ X[Y > 0], family = pospoisson(), data=dsname)

model3.b <- rewlr(I(Y > 0) ~ X, weights0 = w0, weights1 = w1, data=dsname)

aic.val3.a <- AICvlm(model3.a)

aic.val3.b <- model3.b$aic

aic3 <- aic.val3.a + aic.val3.b

#Negative Binomial Regression

model4 <- glm.nb(Y ~ X, data=dsname)

aic4 <- summary(model4)$aic

#Negative Binomial Hurdle Regression

model5 <- hurdle(Y ~ X, dist = "negbin")

aic5 <- AIC(model5)

## REWLR-Hurdle Negative Binomial Regression

model6.a <- vglm(Y[Y > 0] ~ X[Y > 0], family = posnegbinomial(), data=dsname)

model6.b <- rewlr(I(Y > 0) ~ X, weights0 = w0, weights1 = w1, data=dsname)

aic.val6.a <- AICvlm(model6.a)

aic.val6.b <- model6.b$aic

aic6 <- aic.val6.a + aic.val6.b

#AIC Values based on various zero-proportions

aic.values.1 <- data.frame(ref="Poisson Hurdle (PH)", sample.size = n,

47


Zero.Proportion=1-pi, AIC=cbind(aic2, aic3, aic5, aic6))

colnames(aic.values.1) <- c("Reference","Sample size", "Zero Proportion",

"PH", "PHRE", "NBH", "NBHRE")

aic.values.1$PH <- round(aic.values.1$PH, 0)

aic.values.1$PHRE <- paste(round(aic.values.1$PHRE, 0), ’(’,

round(((aic.values.1$PHRE - aic.values.1$PH)/aic.values.1$PHRE)*100, 2),’%)’)

aic.values.1$NBH <- paste(round(aic.values.1$NBH, 0), ’(’,

round(((aic.values.1$NBH - aic.values.1$PH)/aic.values.1$NBH)*100, 2),’%)’)

aic.values.1$NBHRE <- paste(round(aic.values.1$NBHRE, 0), ’(’,

round(((aic.values.1$NBHRE - aic.values.1$PH)/aic.values.1$NBHRE)*100, 2),’%)’)

return(aic.values.1)

}

ph <- rbind(aic.chg.func(200, 0.50),aic.chg.func(200, 0.40),

aic.chg.func(200, 0.25),aic.chg.func(200, 0.10),

aic.chg.func(1000, 0.50),aic.chg.func(1000, 0.40),

aic.chg.func(1000, 0.25),zero.aic.func(1000, 0.10))

#Step 1: Generate data from Poisson Hurdle-REWLR distribution

aic.chg.func <- function(n, pi, zero.prop) {

# Zero-altered poisson random number generator

rhpois <- function(n=n, mu, zprob){

ifelse(rbinom(n, 1, zprob) == 1, 0, rpois(n, mu))

}

Y <- rhpois(n, mu = 1.3, zprob = pi^w1) #Poisson Hurdle-RE

X <- runif(n, -1, 1) # Independent variable X

48


dsname <- data.frame(Y, X)

#Poisson Regression

model1 <- glm(Y ~ X, family="poisson", data=dsname)

aic1 <- summary(model1)$aic

#Poisson Hurdle Regression

model2 <- hurdle(Y ~ X, data=dsname, dist = "poisson", link="logit")

aic2 <- AIC(model2)

## REWLR-Hurdle Poisson Regression

model3.a <- vglm(Y[Y > 0] ~ X[Y > 0], family = pospoisson(), data=dsname)

model3.b <- rewlr(I(Y > 0) ~ X, weights0 = w0, weights1 = w1, data=dsname)

aic.val3.a <- AICvlm(model3.a)

aic.val3.b <- model3.b$aic

aic3 <- aic.val3.a + aic.val3.b

#Negative Binomial Regression

model4 <- glm.nb(Y ~ X, data=dsname)

aic4 <- summary(model4)$aic

#Negative Binomial Hurdle Regression

model5 <- hurdle(Y ~ X, dist = "negbin")

aic5 <- AIC(model5)

## REWLR-Hurdle Negative Binomial Regression

model6.a <- vglm(Y[Y > 0] ~ X[Y > 0], family = posnegbinomial(), data=dsname)

model6.b <- rewlr(I(Y > 0) ~ X, weights0 = w0, weights1 = w1, data=dsname)

aic.val6.a <- AICvlm(model6.a)

49


aic.val6.b <- model6.b$aic

aic6 <- aic.val6.a + aic.val6.b

#AIC Values based on various zero-proportions

aic.values.1 <- data.frame(ref="Poisson Hurdle - REWLR (PHRE)",

sample.size = n, Zero.Proportion=1-pi, AIC=cbind(aic2, aic3, aic5, aic6))

colnames(aic.values.1) <- c("Reference","Sample size", "Zero Proportion",

"PH", "PHRE", "NBH", "NBHRE")

aic.values.1$PHRE <- round(aic.values.1$PHRE, 0)

aic.values.1$PH <- paste(round(aic.values.1$PH, 0), ’(’,

round(((aic.values.1$PH - aic.values.1$PHRE)/aic.values.1$PH)*100, 2),’%)’)

aic.values.1$NBH <- paste(round(aic.values.1$NBH, 0), ’(’,

round(((aic.values.1$NBH - aic.values.1$PHRE)/aic.values.1$NBH)*100, 2),’)’)

aic.values.1$NBHRE <- paste(round(aic.values.1$NBHRE, 0), ’(’,

round(((aic.values.1$NBHRE - aic.values.1$PHRE)/aic.values.1$NBHRE)*100, 2),’%)’)

return(aic.values.1)

}

phre <- rbind(aic.chg.func(200, 0.50),aic.chg.func(200, 0.40),

aic.chg.func(200, 0.25),aic.chg.func(200, 0.10),

aic.chg.func(1000, 0.50),aic.chg.func(1000, 0.40),

aic.chg.func(1000, 0.25),zero.aic.func(1000, 0.10))

#Step 1: Generate data from NB Hurdle distribution

aic.chg.func <- function(n, pi) {

# Zero-altered negative binomial random number generator

rhnbinom <- function(n=n, mu, size=0.5, zprob){

50


ifelse(rbinom(n, 1, zprob) == 1, 0, rnbinom(n, size = 0.5, mu = mu))

}

Y <- rhnbinom(n, mu = 1.3, size = 3, zprob = pi) #NB Hurdle

X <- runif(n, -1, 1) # Independent variable X

dsname <- data.frame(Y, X)

#Poisson Regression

model1 <- glm(Y ~ X, family="poisson", data=dsname)

aic1 <- summary(model1)$aic

#Poisson Hurdle Regression

model2 <- hurdle(Y ~ X, data=dsname, dist = "poisson", link="logit")

aic2 <- AIC(model2)

## REWLR-Hurdle Poisson Regression

model3.a <- vglm(Y[Y > 0] ~ X[Y > 0], family = pospoisson(), data=dsname)

model3.b <- rewlr(I(Y > 0) ~ X, weights0 = w0, weights1 = w1, data=dsname)

#aic.val3.a <- (-2*logLik.vlm(model3.a))+(2*3)

aic.val3.a <- AICvlm(model3.a)

aic.val3.b <- model3.b$aic

aic3 <- aic.val3.a + aic.val3.b

#Negative Binomial Regression

model4 <- glm.nb(Y ~ X, data=dsname)

aic4 <- summary(model4)$aic

#Negative Binomial Hurdle Regression

model5 <- hurdle(Y ~ X, dist = "negbin")

aic5 <- AIC(model5)

51


## REWLR-Hurdle Negative Binomial Regression

model6.a <- vglm(Y[Y > 0] ~ X[Y > 0], family = posnegbinomial(), data=dsname)

model6.b <- rewlr(I(Y > 0) ~ X, weights0 = w0, weights1 = w1, data=dsname)

#aic.val6.a <- (-2*logLik.vlm(model3.a))+(2*3)

aic.val6.a <- AICvlm(model6.a)

aic.val6.b <- model6.b$aic

aic6 <- aic.val6.a + aic.val6.b

#AIC Values based on various zero-proportions

aic.values.1 <- data.frame(ref="NB Hurdle (NBH)", sample.size = n,

Zero.Proportion=1-pi, AIC=cbind(aic2, aic3, aic5, aic6))

colnames(aic.values.1) <- c("Reference","Sample size", "Zero Proportion",

"PH", "PHRE", "NBH", "NBHRE")

aic.values.1$NBH <- round(aic.values.1$NBH, 0)

aic.values.1$PH <- paste(round(aic.values.1$PH, 0), ’(’,

round(((aic.values.1$PH - aic.values.1$NBH)/aic.values.1$PH)*100, 2),’%)’)

aic.values.1$PHRE <- paste(round(aic.values.1$PHRE, 0), ’(’,

round(((aic.values.1$PHRE - aic.values.1$NBH)/aic.values.1$PHRE)*100, 2),’%)’)

aic.values.1$NBHRE <- paste(round(aic.values.1$NBHRE, 0), ’(’,

round(((aic.values.1$NBHRE - aic.values.1$NBH)/aic.values.1$NBHRE)*100, 2),’%)’)

return(aic.values.1)

}

nbh <- rbind(aic.chg.func(200, 0.50),aic.chg.func(200, 0.40),

aic.chg.func(200, 0.25),aic.chg.func(200, 0.10),

aic.chg.func(1000, 0.50),aic.chg.func(1000, 0.40),

52


aic.chg.func(1000, 0.25),zero.aic.func(1000, 0.10))

#Step 1: Generate data from NB Hurdle REWLR distribution

aic.chg.func <- function(n, pi, zero.prop) {

# Zero-altered negative binomial random number generator

rhnbinom <- function(n=n, mu, size=0.5, zprob){

ifelse(rbinom(n, 1, zprob) == 1, 0, rnbinom(n, size = 0.5, mu = mu))

}

Y <- rhnbinom(n, mu = 1.3, size = 3, zprob = pi^w1) #NB Hurdle-RE

X <- runif(n, -1, 1) # Independent variable X

dsname <- data.frame(Y, X)

#Poisson Regression

model1 <- glm(Y ~ X, family="poisson", data=dsname)

aic1 <- summary(model1)$aic

#Poisson Hurdle Regression

model2 <- hurdle(Y ~ X, data=dsname, dist = "poisson", link="logit")

aic2 <- AIC(model2)

## REWLR-Hurdle Poisson Regression

model3.a <- vglm(Y[Y > 0] ~ X[Y > 0], family = pospoisson(), data=dsname)

model3.b <- rewlr(I(Y > 0) ~ X, weights0 = w0, weights1 = w1, data=dsname)

#aic.val3.a <- (-2*logLik.vlm(model3.a))+(2*3)

aic.val3.a <- AICvlm(model3.a)

aic.val3.b <- model3.b$aic

aic3 <- aic.val3.a + aic.val3.b

53


#Negative Binomial Regression

model4 <- glm.nb(Y ~ X, data=dsname)

aic4 <- summary(model4)$aic

#Negative Binomial Hurdle Regression

model5 <- hurdle(Y ~ X, dist = "negbin")

aic5 <- AIC(model5)

## REWLR-Hurdle Negative Binomial Regression

model6.a <- vglm(Y[Y > 0] ~ X[Y > 0], family = posnegbinomial(), data=dsname)

model6.b <- rewlr(I(Y > 0) ~ X, weights0 = w0, weights1 = w1, data=dsname)

#

#aic.val6.a <- (-2*logLik.vlm(model3.a))+(2*3)

aic.val6.a <- AICvlm(model6.a)

aic.val6.b <- model6.b$aic

aic6 <- aic.val6.a + aic.val6.b

#AIC Values based on various zero-proportions

aic.values.1 <- data.frame(ref="NB Hurdle - REWLR (NBHRE)", sample.size = n,

Zero.Proportion=1-pi, AIC=cbind(aic2, aic3, aic5, aic6))

colnames(aic.values.1) <- c("Reference","Sample size", "Zero Proportion",

"PH", "PHRE", "NBH", "NBHRE")

aic.values.1$NBHRE <- round(aic.values.1$NBHRE, 0)

aic.values.1$PH <- paste(round(aic.values.1$PH, 0), ’(’,

round(((aic.values.1$PH - aic.values.1$NBHRE)/aic.values.1$PH)*100, 2),’%)’)

aic.values.1$NBH <- paste(round(aic.values.1$NBH, 0), ’(’,

round(((aic.values.1$NBH - aic.values.1$NBHRE)/aic.values.1$NBH)*100, 2),’%)’)

aic.values.1$PHRE <- paste(round(aic.values.1$PHRE, 0), ’(’,

54


round(((aic.values.1$PHRE - aic.values.1$NBHRE)/aic.values.1$PHRE)*100, 2),’%)’)

return(aic.values.1)

}

nbhre <- rbind(aic.chg.func(200, 0.50),aic.chg.func(200, 0.40),

aic.chg.func(200, 0.25),aic.chg.func(200, 0.10),

aic.chg.func(1000, 0.50),aic.chg.func(1000, 0.40),

aic.chg.func(1000, 0.25),zero.aic.func(1000, 0.10))

#Combine all

allaic <- rbind(ph, phre, nbh, nbhre)

allaic %>% knitr::kable(format=’latex’) %>%

kable_classic_2(full_width = F, html_font = "Cambria")

A.3 Analysis on Maternal Mortality Data

A.3.1 Exploratory Data Analysis

#Read in Data

maternal <- read.csv("D:/MSc/Thesis/Analysis/Maternal Mortality Data.csv")

maternal1 <- maternal[, -1]

#Rename Columns

maternal2 <- rename(maternal1, MaternalDeaths=Maternal.Deaths,

AssistedDeliveries=assisted.Vaginal.deliveries, BreechDelivery=breach.delivery,

55


CS=caesarian.sections, LiveBirths=live.birth,

EarlyTeenPreg=no.adolesc..10.14.years..pregn.at.1st.anc.visit,

LateTeenPreg=no.adolesc..15.19.years..preg..at.1st.anc.visit,

NormalDeliveries=normal.deliveries,

ANC4Visits=anc.4.visits,

Uterotonics3stg=Number.of.women.giving.birth.who.received.

uterotonics.in.the.third.stage.of.labor..or.immediately.after.birth.,

Carbatosin=Mothers.given.uterotonics.within.1.minute..Carbatosin.,

Oxytocin=Mothers.given.uterotonics.within.1.minute..Oxytocin.,

Eclampsia=Eclampsia, AntHaemorrage=Ante.partum.Haemorrage

PostHaemorrage=Post.Partum.Haemorrage,

ObstructedLabour=Obstructed.Labour,

RupturedUterus=Ruptured.Uterus, Sepsis=Sepsis,

FGMComplicatons=Mothers.with.delivery.complications.

associated.with.FGM, Stillbirth=Macerated.still.Birth)

#EDA

#Central Tendency

mean(maternal2$MaternalDeaths)

std.error(maternal2$MaternalDeaths)

#Spread

sd(maternal2$MaternalDeaths)

data.frame(table(maternal2$MaternalDeaths)) %>%

kbl() %>%

kable_classic_2(full_width = F, html_font = "Cambria")

#Histogram

plot2 <- ggplot(maternal2, aes(x=MaternalDeaths)) +

geom_histogram(binwidth=1, fill="skyblue")+

56


labs(x = "Maternal Deaths", y = "Frequency")+

ylim(0, 150)+

theme_classic()

plot2

#Correlation Analysis

cor(maternal2)

#Averaging the factors

sum1 <- as.data.frame(t(maternal2 %>%

group_by(MaternalDeaths.bin) %>%

summarise_all(mean))) %>%

mutate(across(where(is.numeric), round, 1))

i <- which(str_detect(row.names(sum1), "^Maternal"))

sum2 <- sum1%>% slice(-i)

colnames(sum2) <- c("No Maternal Deaths","Maternal Deaths")

sum2 %>% slice(2:n()) %>%

knitr::kable(format=’latex’) %>%

kable_classic_2(full_width = F, html_font = "Cambria")

A.3.2 Count Models

# Sample data = 293 deaths per 53792 live births;

#Y-bar = 293/53792 = 0.0054

# Population data = 342 deaths per 100000 live births;

Tau = 1600/100000 = 0.016

57


#Source: https://www.health.go.ke/wp-content/uploads/2022/01/

Kenya-SDG-Progress-Report_-April21.pdf

w1 = 0.00163/0.0054

w0 = (1 - 0.00163)/(1 - 0.0054)

#Poisson Hurdle Regression

fit2 <- hurdle(MaternalDeaths ~ Stillbirth+ANC4Visits+LateTeenPreg+

AntHaemorrage+PostHaemorrage+BreechDelivery+Carbatosin+Uterotonics3stg,

dist = "poisson", link="logit", data=maternal2)

f.exp2 <- round(sum(predict(fit2, type = "prob")[,1]),0)

f.aic2 <- AIC(fit2)

summary(fit2)

## REWLR-Hurdle Poisson Regression

fit3.a <- vglm(MaternalDeaths ~ AssistedDeliveries+BreechDelivery

+CS+EarlyTeenPreg, family = pospoisson(), data=Maternal2.gt0)

fit3.b <- rewlr(MaternalDeaths.bin ~ Stillbirth+ANC4Visits+

LateTeenPreg+AntHaemorrage+PostHaemorrage+BreechDelivery+Carbatosin+

Uterotonics3stg, weights0 = w0, weights1 = w1, data=maternal2)

f.exp3 <- round(sum(1-predict.rewlr(fit3.b)))

f.aic.val3.a <- AICvlm(fit3.a)

f.aic.val3.b <- fit3.b$aic

f.aic3 <- f.aic.val3.a + f.aic.val3.b

summary(fit3.a)

summary.rewlr(fit3.b)

#Negative Binomial Hurdle Regression

fit5 <- hurdle(MaternalDeaths ~ Stillbirth+ANC4Visits+LateTeenPreg+

58


AntHaemorrage+PostHaemorrage+BreechDelivery+Carbatosin+

Uterotonics3stg, data=maternal2, dist = "negbin")

f.exp5 <- sum(predict(fit5, type = "prob")[,1])

f.aic5 <- AIC(fit5)

summary(fit5)

## REWLR-Hurdle Negative Binomial Regression

fit6.a <- vglm(MaternalDeaths ~ Stillbirth+ANC4Visits+LateTeenPreg+

AntHaemorrage+PostHaemorrage+BreechDelivery+Carbatosin+

Uterotonics3stg, family = posnegbinomial(), data=Maternal2.gt0)

fit6.b <- rewlr(MaternalDeaths.bin ~ Stillbirth+ANC4Visits+

LateTeenPreg+AntHaemorrage+PostHaemorrage+BreechDelivery+Carbatosin

+Uterotonics3stg, weights0 = w0, weights1 = w1, data=maternal2)

f.exp6 <- round(sum(1-predict.rewlr(fit6.b)))

f.aic.val6.a <- AICvlm(fit6.a)

f.aic.val6.b <- fit6.b$aic

f.aic6 <- f.aic.val6.a + f.aic.val6.b

summary(fit6.a)

summary.rewlr(fit6.b)

################ Area under Curve ################

library(pROC)

#Poisson Hurdle

fit2.lm <- lm(MaternalDeaths ~ Stillbirth+ANC4Visits+LateTeenPreg+AntHaemorrage+PostHaemorrage+BreachDelivery+Carbatosin+Uterotonics3stg, data=maternal2)

auc1 <- roc(maternal2$MaternalDeaths.bin ~ fit2.lm$fitted, plot=TRUE, print.auc=TRUE, main="Poisson Hurdle")

#NB Hurdle

fit5.lm <- lm(MaternalDeaths ~ Stillbirth+ANC4Visits+LateTeenPreg+AntHaemorrage+PostHaemorrage+BreachDelivery+Carbatosin+Uterotonics3stg, data=maternal2)

auc3 <- roc(maternal2$MaternalDeaths.bin ~ fit5.lm$fitted, plot=TRUE, print.auc=TRUE, main="NB Hurdle")

59


#Poisson Hurdle - REWLR

auc2 <- roc(maternal2$MaternalDeaths.bin ~ summary.rewlr(fit3.b)$fitted, plot=TRUE, print.auc=TRUE, main="Poisson Hurdle-REWLR")

#NB Hurdle - REWLR

auc4 <- roc(maternal2$MaternalDeaths.bin ~ summary.rewlr(fit6.b)$fitted, plot=TRUE, print.auc=TRUE, main="NB Hurdle-REWLR")

#Comparison of coefficients between models

#Binary part

coef.bin <- data.frame(PH.BINARY =

summary(fit2)$coefficients$zero[,1],

NBH.BINARY = summary(fit5)$coefficients$zero[,1],

PHRE.BINARY = summary.rewlr(fit3.b)$B,

NBHRE.BINARY = summary.rewlr(fit6.b)$B)

#Count part

coef.cnt <- data.frame(cbind(PH.COUNT =

summary(fit2)$coefficients$count[,1],

NBH.COUNT = summary(fit5)$coefficients$count[,1],

PHRE.COUNT = summary(fit3.a)@coef3[, 1],

NBHRE.COUNT = summary(fit6.a)@coef3[-2, 1]))

coef.cnt$PH.COUNT[10] = "NA"

coef.cnt$NBHRE.COUNT[10] = "NA"

coef.cnt$PHRE.COUNT[c(6,7,8,9,10)] = "NA"

coef.bin %>% knitr::kable(format=’latex’) %>%

kable_classic_2(full_width = F, html_font = "Cambria")

coef.cnt %>% knitr::kable(format=’latex’) %>%

kable_classic_2(full_width = F, html_font = "Cambria")

60


#AIC Values based on various zero-proportions

aic.values <- data.frame(Model=c("Poisson", "PH", "PH-RE","NB",

"NB-H", "NBH-RE"), AIC=rbind(f.aic1, f.aic2, f.aic3,

f.aic4, f.aic5, f.aic6), row.names = NULL)

aic.values %>% knitr::kable(format=’latex’) %>%

kable_classic_2(full_width = F, html_font = "Cambria")

#Zero counts

zero.counts <- data.frame(cbind(observed,f.exp1,f.exp2,f.exp3,

f.exp4,f.exp5,f.exp6))

colnames(zero.counts) <- c("Oberved","Poisson", "PH", "PH-RE","NB",

"NB-H", "NBH-RE")

zero.counts %>% knitr::kable(format=’latex’) %>%

kable_classic_2(full_width = F, html_font = "Cambria")

61


Appendix B

Turnitin Report

62


1/20

Document Information

Analyzed document Sharon Okello Thesis.pdf (D138655736)

Submitted 2022-05-31T13:02:00.0000000

Submitted by

Submitter email Awuor.Okello@strathmore.edu

Similarity 1%

Analysis address library.strath@analysis.urkund.com

Sources included in the report

ST404A3.pdf
Document ST404A3.pdf (D27768399)

1

URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4493133/

Fetched: 2020-04-16T00:11:45.3870000
5

Assignment 3.pdf
Document Assignment 3.pdf (D27768595)

1

assignment3_1414386.pdf
Document assignment3_1414386.pdf (D27768445)

1

Handin 1.pdf
Document Handin 1.pdf (D108014019)

1


2/20

Entire Document

Improving Performance of Hurdle Models using Rare-Event Weighted Logistic Regression: Application to Maternal

Mortality Data Sharon Awuor Okello Submitted in partial fulfilment of the requirements for the Degree of Master of

Science in Statistical Sciences of Strathmore University Institute of Mathematical Sciences Strathmore University Nairobi,

Kenya May 31, 2022 This thesis is available for Library use through open access on the understanding that it is copyright

material and that no quotation from the thesis may be published without proper acknowledgement.

Declaration I declare that this work has not been previously submitted and approved for award of a degree by this or any

other University. To the best of my knowledge and belief, the thesis contains no material previously published or written

by another person except where due reference is made in the proposal itself. © No part of this thesis may be reproduced

without the permission of the author and Strathmore University. Name: .................................................. Sharon Awuor Okello

.............................................. Signature: ............................................................................................ Date: ........................................... May

31, 2022 ....................................................... Approval The thesis of Sharon Awuor Okello was reviewed and approved by the

following: Dr. Collins Ojwang’ Odhiambo Supervisor, Institute of Mathematical Sciences, Strathmore University. Dr. Evans

Otieno Omondi Supervisor, Institute of Mathematical Sciences, Strathmore University. Dr. Godfrey Madigu Dean, Institute

of Mathematical Sciences, Strathmore University. Dr. Bernard Shibwabo Director, Office of Graduate Studies, Strathmore

University. ii

Abstract Hurdle models, which are commonly used alongside zero-inflated models to analyze dis- persed zero-inflated

count data, employ a logit link function to predict whether an observation takes a positive count or a zero count based on

a set of covariates. However, the logit model tends to be biased toward the majority zero class in cases involving rare

events, and may underestimate the positive counts when their proportion is significantly smaller than that of the zero

counts. This research aimed to develop and assess the performance of hurdle models incorporating rare-event weighted

logistic regression and their applications to maternal mortality data. Poisson and Negative Binomial (NB) Hurdle Rare Event

Weighted Logistic Regression (REWLR) model estimates were developed and fit on various simulation conditions and

maternal mortality data for performance evaluation using AIC measures. The Negative Binomial Hurdle REWLR emerged

to be the best performing among all the evaluated models due to the ability to handle dispersion and adjust for class

imbalance. The research findings will provide reliable estimates of the maternal mortality ratio in Nairobi without the risk

of over-fitting zero counts. iii

Table of contents List of figures vii List of tables viii List of abbreviations ix Acknowledgement x Dedication xi 1

Introduction 1 1.1 Background to the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Maternal Mortality in Kenya . . . . . . . . . . . . . .

. . . . . . . . . . . 3 1.3 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Objective of the study . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . 5 1.4.1 General Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.2 Specific Objectives . . . . . . . . . . . . . . . . . . . . . . . . . .

5 1.4.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Justification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6

Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Literature review 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . 8 2.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Hurdle Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Zero-inflated Models . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Our Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Methodology 14 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . 14 3.2 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Hurdle-REWLR Model . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 16 3.3.1 Poisson Hurdle-REWLR Model . . . . . . . . . . . . . . . . . . . 17 3.3.2 Negative Binomial Hurdle-REWLR Model . . . . . . . . . .

. . . 18 3.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5 Maternal Mortality data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20 3.6 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Results and Interpretation 22 4.1 Simulation . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . 22 4.2 Application to Maternal Deaths Data . . . . . . . . . . . . . . . . . . . . . 28 4.2.1 Descriptive Statistics .

. . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.2 Maternal Death Models . . . . . . . . . . . . . . . . . . . . . . . . 29 5 Discussion, Conclusion and

Recommendation 33 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . 33 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.4 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . 37 5.4.1 Recommendation for further research . . . . . . . . . . . . . . . . 37 5.4.2 Policy recommendation . . . . . . . . . . . . . . . . .

. . . . . . 37 References 38 Appendix A R CODES 41 A.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 v

A.2 Simulations and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 A.3 Analysis on Maternal Mortality Data . . . . . . . . . . . . . . . . . . .

. . 54 A.3.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . 54 A.3.2 Count Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 vi


3/20

List of figures Figure 1.1: MMR Trends between 2000 - 2017: Source Organization et al. (2019) 4 Figure 4.1: AICs from

Models fit on Poisson Hurdle simulated data, n = 200 . . 24 Figure 4.2: AICs from Models fit on Poisson Hurdle simulated

data, n = 1000 . . 25 Figure 4.3: AICs from Models fit on Poisson Hurdle-RE simulated data, n = 200 25 Figure 4.4: AICs

from Models fit on Poisson Hurdle-RE simulated data, n = 1000 26 Figure 4.5: A