ASSESSING THE PERFORMANCE OF SOME RESAMPLING METHODS USING LOGISTIC REGRESSION 1

This research reports on the performance of two re-sampling methods (Bootstrap and Jackknife) relationship and significance of social-economic factors (age, gender, marital status and settlement) and modes of HIV/AIDS transmission to the HIV/AIDS spread. Logistic regression model, a form of probabilistic function for binary response was used to relate social-economic factors (age, sex, marital status and settlement) to HIV/AIDS spread. The statistical predictive model was used to project the likelihood response of HIV/AIDS spread with a larger population using 10,000 Bootstrap re-sampled observations and Jackknife re-sampled. From the analysis obtained from the two re-sampling methods, we can conclude that HIV transmission in Kebbi state is higher among the married couples than single individuals and concentrate more in the rural areas.


INTRODUCTION Background to the study
Re-sampling is a method for estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping). It is also a method for exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests) and for validating models by using random subsets (bootstrapping, cross validation). Re-sampling does not emerge without any context. Indeed, the resampling method is tied to the Monte Carlo simulation, in which researchers "make up" data and draw conclusions based on many possible scenarios. Methods of re-sampling are playing a vital role in statistical inference, their applications become more popular especially in the last 2 decades. Researchers (such as Bello et al (2015a), Bello et al (2015b), Bobbitet al., (2007) and Wuet al(1986applieda class of weighted jackknife variance estimators for the least squares estimator by deleting any fixed number of observations at a timeand arrived at a conclusion. Bello et al., used only Bootstrap Re-Sampling method to a categorical data of HIV/AIDS and considered only age, sex, and employment statuswhile marital status and Rural/Urban settlements are significant factors of HIV/AIDS transmission and Bobbitet al., also used Bootstrap Application method in the International Price Program. In this research, we will study only two re-sampling methods (Bootstrap and jackknife) simultaneously in the same phenomena so as to see the performance of each and use logistic regression to finally obtain the best method among them.
Bootstrap: This technique was invented by Efron (1979Efron ( , 1981Efron ( , 1982 and further developed by Efron and Tibshirani (1993). "Bootstrap" means that one available sample gives rise to many others by re-sampling (a concept reminiscent of pulling yourself up by your own bootstrap).
The general variance estimation procedures is based on the following substitution idea. The functional form of   2  (as a function of   Efron(1983). We have several motivations to use these methods, for example when usual model assumptions can't be verified in certain dataset. For example non normal distribution with outliers or mixed distribution with errors. We can also point to asymptotic results when the number of available observations aren't enough to guarantee the asymptotic convergence. In these cases, methods based on simulation, more specifically resampling, can be useful to establish the uncertainty about estimation of the parameters.
The logistic regression model is one of the popular statistical models for the analysis of binary data with applications in physical, biomedical, behavioral sciences, and many others. Logistic regression analysis was implemented to determine the significant contributory factors influencing the subject of study. The cases having the response variable as categorical, often called binary of (yes/no; present/absent; etc) and possible explanatory variables which can either be categorical variables, numerical variables or both are numerous in the biometry, psychometric, and epidemiology researches.
Empirical Evidence. Bello et al., (2015b), Bobbitet al., (2007 and Wu et al., (1986) applied a class of weighted jackknife variance estimators for the least squares estimator by deleting any fixed number of observations at a time and arrived at a conclusion. Bello et al., used only Bootstrap Re-Sampling method to a categorical data of HIV/AIDS and Bobbitet al., also used Bootstrap Application method in the International Price Program. Both of them uses bootstrap re-sampling only meanwhile other re-sampling methods can be tested and found to be essential in those aspect.
L. M Raposoet al.,(2013) used the logistic regression model to predict resistance to HIV protease Inhibitor, the model obtained was said to be useful in decision making regarding the best therapy for HIV positive individuals.
The work of J. Renet al.,(2014), was suspicious of the suitability of ordinary, categorical exposures, and logarithm transformation functions presented in logistic regression model to assess if the likelihood of infectious diseases is risk or as a result of exposure using simulated data, the work adjudged the logarithmic transformation function as better than the other two.
However, they found that risk of using logistic regression is not a risk at all if large sample size is used or procedure of large sample technique such as bootstrap re-sampling method is used. This reduced the bias in their estimates. "The odd function is the most suitable function for interpretation of binary predictive problems". Bello et al., (2015), investigate on the relationship and significance of social-economic factors (age, gender, employment status) and modes of HIV/AIDS transmission to the HIV/AIDS spread in Oyo state, Nigeria. They used logistic regression model, a form of probabilistic function for binary response was used to relate social-economic factors (age, sex, employment status) to HIV/AIDS spread. They also used statistical predictive model to project the likelihood response of HIV/AIDS spread with a larger population using 10,000 Bootstrap re-sampled observations. In their findings the age group as block effect shows adequate significant level to HIV infection with F-value 6.496 and significant level 0.004(p<0.05). The age group 16-39 seems to be the age block that is most infected in the population, this age group is the reproductive age and the most sexually active stage of any population which suggests that any additional to the uncontrolled activities of sexual intercourse and pregnancy without proper medical supports will increase the cases of mother-to-child infection in particular. An individual will not contract HIV because he/she belongs to a particular gender; contraction is majorly as result of activities or exposure. They recommended increment in employment level especially for the female gender in Oyo state as a vital control measure to mitigate the spread of HIV/AIDS coupled with increase in public awareness, abstinence, and a more comprehensive approach to preventing mother-to-child infection. The researchers did not consider marital status age group and rural /urban communities as a social economic factors which are also significant in the modes of transmission of HIV/AIDS. Apart from Bootstrap other re-sampling methods can be used to make a good estimation.

Introduction
Considering the case where our response i y is a dichotomous response, when possible response is either yes or no, death or alive, present or absent and as the case at hand in this work is either HIV negative or HIV positive.
We coded the present or absent of subject of study as 1 and 0 respectively. The distribution of yi is binomial of a single trial or basically Bernoulli distribution as used by some text. The binary indicator variable outcome can only be 1 or 0 as the probability is bound between 0 and 1; this gives a sigmodial shape approaching 0 and 1 asymptotically. This is a nonlinear problem.

Logistic Regression:
The logistic function relating yito predictors which can be qualitative, quantitative or both is a very flexible model which makes it vital to solving many epidemiology and social indicators related problems. The logistic line can also be of form; π is the probability of our subject of interest in study taking place and 1-π is the probability of subject ofinterest not occurring. The subject of interest informs our choice of coding 1 or 0. Equation (3.3) gives the probability yi given that level of parameter variable isx1. Logistic regression model is a special case of general linear model. The special problems associated with model having binary response variable is the problem of having our error terms not normally distributed and heteroskadastic in nature due to the distribution of our response variable bonded between 0 and 1.

Usman et al., FJS
From equation (3.3.1) ε is not normally distributed. Other problem associated with the logistic model is the constraint condition on response function. The function form in equation (3.3) has its left hand-side to take value ranging between 0 and 1, while the right hand-side is not in a form that can return values between 0 and 1 asymptotically. Therefore, we require a link function to properly link the left hand-side to the right hand-side. Link function such as identity function will not be appropriate for the nonlinear problem at hand. However, out of several possible link functions, we shall use the logit function, for easier understanding and interpretations of our results.
The model is written in the form; The odds can vary on a scale of (0,∞), so the log odds can vary on the scale of (−∞,∞)precisely what we get from the right hand side of the linear model. For a real-valued explanatory variable xi, the intuition here is that a unit additive change in the value of the variable should change the odds by a constant multiplicative amount.
Exponentiating, this is equivalent to equation 3.4 The interpretation of 's is not straightforward because the increase in unit of X varies for the logistic regression model according to the location of the starting point of the X. The logit function is the natural logarithm (ln) of odds of y and taking exponential of the log of odd function gives us the most appreciable odd function that is vital in our interpretation of result.
The odd function will simplify the interpretation problem. (1 ) The inverse of the logit function is the logistic function. Hence; The logistic function form in equation (3.7) and (3.8) will return the right hand-side to be property value ranging from 0 and 1. The function increases monotonically if the gradient θ>0and decreases monotonically if θ < 0.

Method of Estimation
The variability of the error terms variances differs at different level of X, as shown in equation (3.3.1). This makes Ordinary Least square estimation ineffective in estimation of logistic function. The maximum likelihood is a better method for estimating logistic function since logistic function predicts probabilities, and not just classes, it can fit the probabilities for each class of our data-point, either for the class 'πi ' or '1-πi'. We must also note that the error term is not usually considered in logistic problems.

Maximum Likelihood Estimation
The maximum likelihood estimate is that value of the parameter that makes the observed data most likely (3.12). The values of s that maximize log ( ) e L  , that is, the value of  that assign the highest possible probability to the sample that was actually obtained. The method of likelihood in estimating a logistic function usually requires numerical procedures, and Fisher scoring or Newton-Raphson often work best. Most statistical packages have the logit numerical search procedure. In this work, R-programming language package for obtaining the maximum likelihood estimates of a logistic regression is used.
Recall equation (3.7), π is substituted into equation ( The differentiation of the log likelihood function in equation (3.13) with respect to each parameter j  will not analytical give us the maximum likelihood estimates by setting each of the k equations in equation (3.13) equal to zero. It is a system of k nonlinear equations. The solution to the K unknown variables is a nonlinear problem cannot be solved analytically but through numerical estimation using an iterative process. The Newton-Raphson method is popularly used for a logistic nonlinear function. However, problem of multi-collinarity may arise which is visible when there are large estimated parameters and large standard error values. Also, convergence problem in numerical search procedure can be associated with multicollinearity problem which can be overcome by reducing the number of parameter variables for easy and quick convergence.

Variance Estimation of a Logistic Function Using the 4 Re-sampling Method
The general linear model rely on asymptotic approximations in estimating the coefficient standard errors and this may not be reliable, just as measures such as R-square based, residual errors are not very informative and can be misleading. Therefore, using thesemethodsa re-sampling technique (Bootstrap, Jackknife, Randomization exact test and Cross validation) will either confirm or dispel our doubts about the sufficiency of our sample to estimate unbiased and robust estimates for the population parameters. For our models to adequately capture the reality of HIV/AIDS spread across different socio-economical classes in Kebbistatepopulation as likely as possible. We shall Generate 10,000 Re-sampling samples from the original sample to estimate our models' parameter values and their confidence intervals.

The Odd Function
Bland and Douglas (2000) mentioned that there are mainly three reasons to use the odds ratio. "Firstly, they provide an estimate (with confidence interval) for the relationship between two binary variables. Secondly, they enable us to examine the effects of other variables on that relationship, using logistic regression. Thirdly, they have a special and very convenient interpretation." The odds are nonnegative, with odds 1.0 when a success is more likely than a failure. According to Pedhazur (1997) Odds are determined from probabilities and range between 0 and infinity. Odds are defined as the ratio of the probability of success and the probability of failure. The odds of success given as π/1-π and the odds of failure would be odds (failure) given as 1-/  . The odds of success and the odds of failure are just reciprocals of one another. Probability and odds both measure how likely it is that our subject of interest will occur. Notably, the sign of the log-odds ratio indicates the direction of its relationship, the distinction regarding a positive or negative relationship in that of the odds ratios is given by which side of 1 the odd values fall on. Odd value 1 indicates no relationship, less than one indicates a negative relationship and greater than one indicates a positive relationship. However, in order to get an intuitive sense of how much things are changing, we need to get the exponential of the log-odds ratio, which gives us the odds ratio itself.
The odd ratio of the odd for x=1 to the odd of x=0 is   The residuals plotted against the predicted probability (See figure 2), shows the lowess smooth approximates a line having zero slope and intercept, and we can conclude that model inadequacy is not apparent. The parameter estimates from the original observation and the 1,000 bootstrap samples were asymptotically the same and both confidence intervals coincide at almost the same intervals; these demonstrate the precision of the model coefficient estimates. Thus, we can conclude within approximately 95 percent confidence that our sample size is as sufficient as using any other larger sample size, (See table 1). Statistically, we can say that our sample size is a good representation of the entire population from which it was drawn. Recall The negative coefficient value of sex parameter suggests a negative relationship between age and HIV infection even though it shows great statistical significance at 0.05 level of significance with standard error of 0.318448. However, an individual will not contact HIV because he/she belongs to particular sex, and contraction is majorly as a result of activities or exposure. Also from the table 1 above, the difference in the odds ratio of HIV infection between the male married individuals in the urban population and male single individuals in the urban population is 0.1767087. This result implies that the odds of male married individuals in the urban population contracting HIV are 18% higher than that of the male single individuals in the urban population for given age. The difference in the odds ratio of HIV infection between the male married individuals in the rural population and their counterpart is 0.1695194. This result implies that the odds of male married individuals in the rural population contracting HIV are 17% higher than that of the male single individuals for given age. The odds ratio of HIV infection between female married individuals in urban population is 54% higher than those unmarried female individuals in the same population. Likewise an odd of HIV infection between married female individuals in rural population is 52% higher than that of unmarried counterpart.

PREDICTIVE MODELS
We can now predict the likelihood of HIV spread in kebbi State among different sex, marital status and settlement across all possible age. The negative coefficient value of Age parameter suggests a negative relationship between age and HIV infection, which imply that the probability of contracting HIV decreases as Age of person(s) increases. The odds is best described by exp(c*Age), given c is a difference of units of ages under comparison. For the difference of unit between age 25 and 35years, the odds of contracting HIV between age 25 and 35years is exp (10*(0.006179)= 0.03095. This indicates a negative relationship between age and HIV infection and we can say that the odds of contracting HIV decreases by 3% with each additional 10 years increase in age. The inverse relationship between age and Probability of HIV infection suggests that the younger generation below the age of 35years should be of first priority in all the agenda towards eradicating HIV/AIDS spread. From the table 2 above, the difference in the odds ratio of HIV infection between the male married individuals in the urban population and male single individuals in the urban population is 0.259615. This result implies that the odds the male married individuals in the urban population contracting HIV are 26% higher than that of the male single individuals in the urban population for given age. The difference in the odds ratio of HIV infection between the male married individuals in the rural population and their counterpart is 0.2631593. This result implies that the odds of male married individuals in the rural population contracting HIV are 26% higher than that of the male single individuals for given age. The odds ratio of HIV infection between female married individuals in urban population is 20% higher than those unmarried female individuals in the same population. Likewise odds of HIV infection between married female individuals in rural population are 21% higher than that of unmarried counterpart.

PREDICTIVE MODEL
We can now predict the likelihood of HIV spread in Kebbi State among different sex, marital status and settlement across all possible age. 3. The predicted model for male married individuals in urban population.

ASSESSING THE PERFORMANCE…
Usman et al., FJS