TOWARDS PREDICTION OF CEREBROSPINAL MENINGITIS DISEASE OCCURRENCE USING LOGISTICS REGRESSION – A WEB BASED APPLICATION

Cerebrospinal meningitis (CSM) is characterized by acute severe infection of the central nervous system causing inflammation of the meninges with associated morbidity and mortality. The information about its symptoms, time and season of spread, most affected region, its fatality rate, type and how easily it causes major disabilities in patients can be modelled and utilized in its treatment, and prevention. This research uses data mining techniques to predict the occurrence of CSM in terms of those liable to be infected by the disease using feature information about the region and the patient. It encompasses data collection, preprocessing, exploration, algorithm training, prediction, and web hosting. The intention is to help in managing the resources needed for both treatment and prevention. The outcome of the research indicated that the proposed technique is viable for the task, considering the number of correct predictions that was reported when the application was deployed and tested.


INTRODUCTION
Cerebrospinal meningitis (CSM) is the inflammation of the tissues of the meninges (dura mater, arachnoid mater, and pia mater) covering the brain and the spinal cord.It is an imperative cause of morbidity and mortality in many regions of the world (Swathy and Saruladha, 2021).It is majorly due to infectious agents such as viruses, bacteria, and fungi.In rare cases, meningitis can be caused by certain medications.Causes of meningitis include Neisseria meningitides (meningococcus), Streptococcus pneumonia (pneumococcus), Haemophilus influenza, Streptococcus agalactiae (group B streptococcus), Mycobacterium tuberculosis, Salmonella, Listeria, Streptococcus, Staphylococcus, and so on (World Health Organization, 2021).The signs and symptoms of CSM include headache, fever, vomiting, seizures, reduced consciousness, fatigue, nuchal rigidity, and positive Kerning's and Brudzinki signs.The disease can affect children, young adults, and old people.However, children are at particular risk of developing serious complications (Zaccari and Marujo, 2019).Cerebrospinal meningitis is transmitted through the nose and throat secretions such as during sneezing and coughing.Although there is the epidemiological transition from communicable to non-communicable diseases such as stroke, hypertension, diabetes mellitus, arthritis, and sickle cell disease the world over, infectious diseases remain a burden in some developing countries (Liu et. al., 2021).For instance, the Federal Ministry of Health Nigeria in 2017 reported that there was an outbreak of Cerebrospinal Meningitis in Nigeria involving 16 states of the Federation with a recorded 2524 cases and 328 deaths so far.Even in the past years, there were many reports of outbreaks that claimed many lives.This is not surprising as the Northern region of the country is known to be in the Meningitis Belt (Wu et al., 2020).According to (Bello et al., 2020), Climate change, which is reflected in rain fall and temperature of the atmosphere has a significant effect on the weather condition and adaptability of humans to diseases.The meningitis belt of sub-Saharan Africa that stretches from Senegal to Ethiopia experiences the highest disease burden, with more than 400 million people in 26 countries (Patel et al., 2022).Diagnosis of cerebrospinal meningitis is mostly carried out in the laboratory, which requires state-of-the-art equipment that are capital intensive, and also requires highly skilled personnel in the field (Seho et al., 2022).Considering this requirement and the state of most countries in the meningitis belt as a third-world countries, most of these requirements are not easily met in terms of capital and human resources.This contributes to the high fatality rate of the disease and the lack of its prevention and treatment.As stated in (Mentis et al., 2021), mitigating meningitis remains both a global health challenge and a clinical emergency issue, the latter even in wealthy settings and developed nations.The WHO established the global strategy "Defeating Meningitis by 2030" with the assistance of other partners.In 2020, the World Health Assembly approved the policy in the first meningitis resolution ever, and all WHO member states unanimously supported it.The roadmap sets a path to achieve goals, through concerted actions across five interconnected pillars which include: Prevention and epidemic control focused on the development of new affordable vaccines, achievement of high immunization coverage, improvement of prevention strategies and response to epidemics; Diagnosis and treatment, focused on speedy confirmation of meningitis and optimal patient care; disease monitoring to inform meningitis prevention and management; Meningitis care and support, with an emphasis on early detection and better access to care and support for complications from meningitis, and advocacy and engagement, to ensure high awareness of meningitis, to promote country engagement and to affirm the right to prevention, care, and after-care services.Although, meningitis can be diagnosed based on outcome from a series of lab tests, weather conditions, season, and region of residence among other factors can be used to predict the occurrence of the disease.This paves way for automation where health workers are enabled to make informed decisions based on the data provided with the aid of machine learning algorithms.This method involves the following stages; Data acquisition, preprocessing, exploration, algorithm training, evaluation and prediction.The outcome of the aforementioned procedure is a model which can be embedded into a web application for use by health workers and patients.It is a fact that a higher percentage of people have access to IT gadgets and the internet, which equally encourages the spread of information.Hence, this research creates a Web Application that would predict the occurrence of Cerebrospinal Meningitis in certain areas in Northern Nigeria.The Northern Nigeria falls into the meningitis belt.
According to NCDC, Cerebrospinal Meningitis is an epidemic-prone disease with cases reported all year round in Nigeria.The "Meningitis Belt" of Africa, which is located south of the Sahara Desert, has the largest burden.In Nigeria, the belt includes all 19 northern states, the Federal Capital Territory (FCT), and some Southern States.This has made cerebrospinal meningitis (CSM) a disease that needs to be controlled to help prevention and treatments to increase life expectancy.In a news release published in the Nigerian PUNCH Newspaper on Wednesday, January 11, 2023, and signed by Dr. Ifedayo Adetifa, Director-General of NCDC, "According to the Nigeria Center for Disease Control, 159 Local Government Areas in 32 States, including the Federal Capital Territory, have reported 961 probable cases of CSM thus far in 2022".CSM remains a priority disease and ever-present public health threat in several states of the country.Machine learning is set to transform healthcare by providing innovative solutions including prediction, detection, diagnosis and cure.Continual research for better performance in this domain cannot be overemphasized.(Mentis et al., 2021) carried out a differential diagnosis of meningitis based on machine learning algorithms using different combinations of covariates.Three machine learning algorithms which include Multivariate Logistics Regression (MLR), Random Forest (RF) and Naïve Bayes and (NB) were applied for predicting the types of meningitis of patients in the three age groups 0-14 years and greater than 14years.The performance of the ML algorithms was evaluated through a crossvalidation procedure, and optimal predictions of the type of meningitis were above 95% for viral and 78% for bacterial meningitis.Overall, MLR and RF yielded the best performance.
In (Seho et al., 2022), an expert system that uses a feed forward Artificial Neural Network (ANN) was developed for classifying meningitis.Two learning algorithms were employed for this purpose.The first was Levenberg-Marquardt training algorithm which is most suitable for pattern recognition.The second algorithm used is particle swarm optimization (PSO).The purpose of this algorithm is to adjust the decision threshold so that the ANN is fully optimized for the prediction task at hand.A database containing temperature, glucose ratio, proteins, Cerebrospinal fluid (CSF) leukocytes, glucose, lactates and erythrocyte sedimentation rate (ESR) was used.The result revealed that ANN yielded a good performance with a neural network of 15 neurons in hidden layers, giving an accuracy of 96.67%, among other performance of the other evaluation parameters.The accuracy of results and coverage of the research needs to be further explored due to limited availability of research in this domain.

Research approach
This research employs the use of exploratory data analysis.It acquires data from a well-known public dataset repository named Kaggle, preprocess the data using python packages that include pandas, mathplotlib, seaborn and sci-kit learn.It then builds a logistic regression model which was hosted as a web application on a web server.The process is as shown in Figure 1 below

Dataset description
The dataset used is disease outbreak in Nigeria (DON) dataset, which comprises of 40 (features) and 284484 rows (instances).It is from this dataset that the outbreak for northern Nigeria was culled, this is made up of 40 columns and 130993 rows (Emmanuel, 2019).The features in the data include settlement, state, report date, gender, age, and so on as shown in Table 3.1, some of which are used for predicting the possibility of the disease for a patient.The dataset has 27 numerical columns and 13 categorical columns.There are 39 features (input variables) and a label (output variable) which is the meningitis column.The presented dataset contains 30993 samples distributed in two categories: (1) not infected subjects and (2) infected subjects with disease.The output is in categorical form either 0 or 1.The label variable, either the patient has meningitis or not.
12999 Positive cases against 17994 Negative c ases.

Research Experiment Adopted Tools and Libraries for Exploratory Analysis and Visualization
Numpy and Pandas are two of the libraries used to read data into the python notebook.The Numpy works well for numerical computing, particularly when working with multidimensional arrays, it is easier for the training algorithm to process and analyze the data.Data is read into the notebook and formatted using the Pandas Python library, which is provided in a data frame format to enable easier further processing of the data.
The exploratory data analysis and visualization tools used in this study include Pandas, Matplotlib, and Seaborn.While plotting and displaying plotted graphs are handled by Matplotlib and Seaborn, operations such as grouping, obtaining value counts, obtaining summary statistics, and so on are handled by Pandas.

Data preprocessing
Preprocessing was carried out in this research using the Pandas, Sci-kit Learn, and Numpy framework.Pandas is utilized for recovering missing values, whereas Sci-kit Learn is used for separating and encoding data.Numpy was used for feature engineering and outlier repair.
The checks for missing data, duplicates, outliers and label imbalances were done, all of which gave false result except for label imbalance, in which Imblearn SMOTE was used to correct the imbalance.Synthetic Minority Over-sampling SMOTE creates synthetic samples from the minor class instead of creating copies.Feature scaling was carried out on the Age feature.Furthermore, Encoding was used to create new features, which were then added to the dataset's existing features after the suitable ones had been chosen for the algorithms' training.Some of the preprocessing procedure are shown in Figure 2a, 2b and 2c.

Building Web Application
The Streamlit Python Framework and the Visual Studio Code IDE were used to build the web application.On the VS Code IDE, the package is executed as a Python script.The web application was built in a number of steps, including developing widgets for data input, coding the backend that generates predictions, and developing widgets for the output of predictions.The input widget either takes in a single or multiple instance of query data for prediction.In this research, Streamlit Python library was used to create the web application.The online application is created using the Python programming language and current application widgets.

Hosting on Cloud Platform
The web application that was developed was hosted on the Heroku cloud application platform.Heroku account login and Heroku application are two prerequisites for the action.

RESULT AND DISCUSSION
The dataset which is a comma-separated value (CSV) file was loaded into the Jupyter notebook using the Pandas Python library.The result of the experiment carried out based on the aforementioned stages in this research is presented in this section.

Result of Exploratory Data Analysis and Visualization:
The dataset was evaluated to identify correlations between columns and trends within and between features.Both categorical and numerical features were examined using univariate, bivariate, and polyvariate exploratory analysis and visualization techniques.These include the univariate panda's value counts for non-graphical categorical features and numerical features.A sample result of the visualizations done on the dataset is shown in Fig 5 .This result shows that the northern states are all at risk of the infection.

Result of Input, Backend Processing and Output of Model Prediction
After the input and the backend processes, the output of prediction is presented as the patient either having meningitis or not.At values below 0.5, the test patient is classified as not having meningitis, a value of 0 is recorded on the prediction column, while values of 0.5 and above indicates that the patient has meningitis, and value of 1 is recorded.There are two classes for the prediction columns: "infected" and "not infected" and the probability column provides the likelihood that the patient has contracted the infection.The probability value is generated based on the input features into the Logistic Regression model employed.The sample output forecasts is shown in Figure 6.

Hosting the Web Application on Heroku Cloud Application Platform
The model was hosted on the free and widely used Heroku cloud application platform, which has premium options for hosting more than five active web applications.It was easily downloaded and an account was created which was used to access the web application.The command line interface for the process of hosting the web application on the cloud server are shown in Figure 7 and Figure 8 respectively, as well as the results from the hosted web application from the simulation.From the simulation experiment conducted in this research, some feature input are noted to have notable significance on the predictions.The features include settlement, age, geopolitical zone, and gender.These features affect predictions in varying magnitude with settlement and age having the most effect.This is evident as most patients predicted to be infected with meningitis are rural settlement dwellers while children living in the urban settlement are not usually predicted to be infected.Patients from the northeastern part of the country have a higher probability of being infected with meningitis and the probability range from the experiment conducted is between 0.48 and 0.53. Figure 9 and Figure 10 respectively are screenshot of simulation result of some predictions from 50 instances provided as input and the multiple prediction output respectively.From the above figures, some instances from the multiple predictions include; Danbazu Mero who is a 65 year old female that stays in the northcentral part of the country and is not infected.Ibrahim James who is 1 year old baby boy that resides in an urban area located in the northeastern part of the country and is not infected.Potiskum Bola, an 8 year old girl that resides in the rural area of the northwestern part of the country is not infected.These shows varying level of influences of different features in the prediction.

CONCLUSION
It can be observed from the experiment that input variables behaves differently based on various combination.Feature combination of urban dwellers show a higher tendency of not getting infected compared to rural, this in the real world can be attributed to improved health care facilities obtained in the urban areas.It is recommended that real-world health-related datasets should be made available and accessible for the purpose of conducting research, especially, those of

Figure 1 :
Figure 1: Process Flow of the Proposed Model

Figure
Figure 2a: Feature Encoding

Figure 3b :
Figure 3b: Classification Report.Additionally, using the joblib Python module, the logistics regression model was saved and further hosted as a web application on the cloud platform.The code snippet for storing the model is shown in Figure4.

Figure 6 :
Figure 6: sample instance of prediction Data Frame.

Figure 9 :
Figure 9: Multiple Data Input.Figure 10: Multiple Data Results

Figure 10 :
Figure 9: Multiple Data Input.Figure 10: Multiple Data Results

Table 1 : Some of the features of the dataset and their variable definitions Feature Variable definition Example value Age
The age of the patient in years.Ranges from 0 to 78 years with a median value of 34 years.GenderThe gender of the patient is either male or female.67782Femaleclass against 63211 Male class.StatesState of the patient from the northern states.18 states of the federation including the Federal Capital Territory (FCT).