Dadris: The Regional Crime Puzzle

Overview

People often say poverty and unemployment are the roots of many systemic problems in the Philippines.

While poverty and unemployment are often framed as economic concerns, their reach often extends beyond, sometimes even shaping the safety and security of Filipinos.

Across the country, crime remains a pressing issue that affects both rural and urban areas. In truth, the Philippines stands as the 25th country with the highest criminality globally [1]. But what factors play a major part in this alarming issue? Does the prevalence of shabu, guns, and neglected children solely define our country’s crime rate [2]? Also, does every region in the country affect this “national crime rate” equally, or do some regions commit relatively more crime than others?

In this study, we approach these questions by examining the regional crime puzzle, as the title implies. We aim to dissect how poverty and employment levels vary across regions, and how these socioeconomic factors may help explain crime trends. While not implying direct causation, our goal is to ultimately uncover correlations and patterns that shed light on the extent to which improving economic opportunities could influence public safety.

Background

Poverty and unemployment continue to define the socioeconomic landscape of the Philippines.

According to official estimates, the national poverty rate declined to 15.5% in 2023, down from 18.1% in 2021, representing about 17.54 million Filipinos living below the basic needs threshold [3].

At the same time, crime remains a critical public concern. In the 2023 Global Organized Crime Index, the Philippines ranked 25th out of 193 countries in terms of criminality, indicating relatively high exposure to organized crime risks [4].

Moreover, crime is not distributed uniformly across the archipelago. Recent regional-level studies of index crimes (such as theft, assault, property crimes) have shown that incidence and types of crime differ significantly across administrative regions, especially during and after the COVID-19 pandemic [5].

These observations point to pivotal questions: Why do some regions experience more crime despite similar economic conditions? How well do poverty and employment explain these spatial differences? By focusing on regional-level analysis from 2018 to 2023 (with 2025 for validation), this study aims to clarify how variations in socioeconomic status align with crime outcomes and whether improving economic conditions could help foster safer communities.

Problem

Employment and poverty are often solely used to evaluate the economic state of the country but in reality, they are deeply tied to the crimes that happen across different regions in the Philippines.

Without a clear grasp of how employment and poverty affect crime, we risk overlooking crucial factors that may aid in resolving the age-old issue of crime in the country.

Solution

Recognizing this issue, what do we do?

We leverage the power of data science and machine learning to analyze statistics surrounding employment, poverty, and crimes. By dissecting annual and regional data from 2018 to the most recent available data, we hope to surface correlations that can spark deeper conversations about crime and its root causes that facilitate the lived realities of Filipinos today.

Research Questions

To move beyond intuitions and surface real patterns, we framed the following research questions:

1. What is the correlation between poverty and employment with crime counts/rates across regions in the Philippines?
2. How do regional variations in these socioeconomic and institutional factors (i.e., poverty and employment) influence differences in crime counts/rates?
3. To what extent can improvements in these factors (e.g., reduced poverty, increased employment) predict a decrease in crime rates across regions?

Hypotheses

To anchor our investigation, we hypothesized:

Null Hypotheses

H_0,1: Poverty levels have no significant effect on crime counts/rates across Philippine regions.
H_0,2: Employment levels have no significant effect on crime counts/rates across Philippine regions.

Alternative Hypotheses

H_A,1: Poverty levels have a significant effect on crime counts/rates across Philippine regions.
H_A,2: Employment levels have a significant effect on crime counts/rates across Philippine regions.

Action Plan

Finally, we mapped out a plan:

We collect annual crime data by region, along with employment and poverty data by region, to explore trends and assess correlations between crime rates and both employment and poverty statistics.

Data Collection

The following datasets form the foundation of our regional analysis.

Employment

The employment data from 2018 to 2022 were also retrieved from the Philippine Statistics Authority (PSA), specifically from Table 1, Chapter 11 of the 2024 PSA Yearbook. These include the estimated population per region, the labor force participation rate, and the employment and unemployment rates. The 2023 data, on the other hand, was retrieved from PSA, but from their 2023 Annual Provincial Labor Market Statistics (Preliminary Results). The labor force participation, employment, and unemployment rates were taken from tables 1, 2, and 3, respectively. It is necessary to note that the population numbers pertain to people above 15 years old. Furthermore, the measurements are estimates and should be read in thousands. The given data already has its own statistical analysis, such as the coefficient of variation (CV), standard error, 95% confidence intervals, and many more. These processed data were not used in this group’s data set. Lastly, the employment data is the most complete among the datasets, with no missing years from 2018 to 2023.

Combination 2013-2022

Labor Force 2022-2023

Employment 2022-2023

Unemployment 2022-2023

Philippine Statistical Yearbook (2024)

2023 Annual Provincial Labor Market Statistics

Poverty

Regional poverty statistics was sourced from the Philippine Statistics Authority (PSA) database, encompassing quarterly data from years 2018, 2021, and 2023. Among 432 regional samples (18 regions per specific poverty field category and 24 collated tables, hence, 18 * 24 = 432 regional samples), we utilized only Table 5a and Table 6a of the dataset (totalling 18 * 3 = 54 regional samples) since they focus on the income-based magnitude of poor individuals residing in urban and rural areas, respectively. Intuitively, poor individuals from urban and rural areas collectively account for the total number of poor individuals from the regions considered. Likewise, income is said to be the primary basis for poverty, which is why this is the relevant table for the study.

Note: Data for regional poverty statistics is only collected quarterly which explains the gaps between the years considered. This explains why the study will consider three (3) methods to interpret the data:

1. Direct Analysis (Sparse Years) - Correlation and regression analysis using only the available years (2018, 2021, 2023).
2. Tolerance of Gaps - Treating missing years as gaps without interpolation (acknowledging limited temporal resolution).
3. Imputation & Time Series Modeling - Estimating values for gap years (2019, 2020, 2022) using statistical interpolation and time series forecasting techniques.

Poverty (2018, 2021, 2023)

PSA on Poverty Incidence

Crime

Similar to the other datasets, these were retrieved from the Philippine Statistics Authority (PSA) yearbooks. These were all taken from Chapter 17: Public Order, Safety, and Justice, but differed in terms of year of publication. The 2018 data were from Table 3 of the 2019 yearbook. The 2020 and 2021 data were from Table 3 of the 2021 yearbook. Lastly, the 2022 and 2023 data were from the 2024 yearbook. The total number of index and non-index crimes was tracked. Index crimes are categorized as serious crimes such as homicide, robbery, physical injury, and rape. On the other hand, non-index crimes cover less extreme cases, such as illegal logging, illegal possession of firearms and drugs, and other local ordinances. Adding these properties will give the total number of crimes in that region. Furthermore, the cleared cases, meaning a charge has been placed against an individual, for these types of crimes, were also included in the dataset. Lastly, the total number of solved crimes, the cases where the criminal is taken into custody, is also included to better measure the efficiency of law enforcement in each region. Unfortunately, the 2019 dataset for crime was not found in any publicly available database. The 2025 dataset was retrieved from foi.gov.ph in response to a successful public request in a form of a scanned document. The data was manually encoded in the unified data sheet below.

FOI Crime Request (2025)

Unified

This is the unified dataset for relevant data from the Employment, Poverty, and Crime datasets. Because regional poverty estimates are released only for 2018, 2021, and 2023, there are missing intermediate years. To address this, our analysis will consider three approaches:

1. Direct Analysis (Sparse Years) - Correlation and regression analysis using only the available years (2018, 2021, 2023).
2. Tolerance of Gaps - Treating missing years as gaps without interpolation (acknowledging limited temporal resolution).
3. Imputation & Time Series Modeling - Estimating values for gap years (2019, 2020, 2022) using statistical interpolation and time series forecasting techniques.

Crime rate data from 2018 to 2025 are all complete besides 2019. Negros Island Region (NIR) was omitted due to it only being reestablished in 2024.

Combination 2013-2022

Google Sheets

Methodology

Data Preprocessing

As we are dealing with employment, poverty, and crime rate numbers across the Philippine regions, we need to clean and preprocess them separately. In doing so, we ensure the integrity of our data and thus our results & findings at the end of the study! Since we already have our unified dataset, we dissected this to confirm that our data remains consistent with the discussions in the data’s nature as explained in the Data Collection section.

Data Preprocessing Workflow (from main.csv to reinforced_dataset.csv & standardized_dataset.csv)

1.) Data Conversion

Upon inspecting and understanding the unified dataset initially, we observed that its shape it has 119 rows and 24 columns, giving it a shape of (119, 24). From this, we noticed that the dataset is small-scale. Apart from its size, we also noticed that some columns need to be converted to a numerical value to be compatible with our study’s successive statistical tests. As such, we convert these columns (i.e., population_estimate_15_over, labor_force_estimate, poverty_rural, index_crime, index_crime_clear, nonindex_crime, nonindex_crime_clear, crime_sovled_total ) to a numerical value (either float or integer but in our case, we converted it to a float).

2.) Data Dropping and Text Nuances

After data conversion to numerical values, some columns are irrelevant to the study’s aim. As such, said columns were dropped which cut the dataset’s shape to (119, 10). Apart from this, column(s) were renamed just for uniform’s sake (i.e., crime_sovled_total to total_crime_solved since we were getting annoyed with the spelling haha…)

3.) Data Scaling

From the unified dataset, we noted that some were not correctly up to scale, such as population_estimate_over, poverty_urban, and poverty_rural. In truth, their values were divided by 1000 from the dataset, which inclined us to multiply their values by 1000 to affix them to their true values.

4.) Feature Engineering and Selection

Normally, this is done after outlier detection but since outliers may opt us to do this again… it was a tactical option to do this before outlier testing so that data cleaning (if needed) would likewise be only done once. Primarily, we engineered and added features such as total_poverty to further test our hypotheses and help in answering the research questions. More so, we ensured the inclusion of poverty and crime RATES to account for differences in regional populations. After all, larger populations may intuitively correspond to higher poverty, crime, and employment figures. Thus, we’re gonna be evaluating the effect of RAW COUNTS and RATES to crime on both scales! This then boomed our dataset’s shape from (119, 10) to (119, 18).

5.) Outlier Detection

Proceeding with outlier testing, we ensured to avoid hastiness by merely picking some outlier testing and running our dataset on it. As such, we first ran normality testing using the Shapiro-Wilk test to check if our dataset was normally distributed. If it were, we ran a Z-score test. Otherwise, we ran an IQR test. From this, we noticed that the only "Likely Normal" feature in our dataset was total_poverty. As such, we ran every feature on IQR, whilst running only total_poverty on Z-score. From this, we got similar results, showing only one (1) critical outlier - an inconsistency to total_poverty_rate which amounted to 109%…. which is ABSURD! This suggested a data inconsistency in the original dataset. As such, this calls for a fix… which we carried out through data imputation!

6.) Outlier Treatment via Data Imputation

To handle the absurd outlier mentioned above, we utilized mean imputation because we were concerned with both counts and rates anyways… This effectively resulted to an outlier-free dataset!

7.) Data Standardization

Since we would be using machine learning models in the latter part of the study (i.e., LinearRegression, PCA, Lasso, Ridge, etc.), standardizing the current dataset was a must. To do so, we carried out data standardization using RobustScaler() because our data was not normally distributed.

8.) Dataset Visualization & Storing

With everything cleaned and standardized, the datasets to be used for statistical testing and machine learning are now intact! Note that in this study, we will be using two (2) datasets:

reinforced_df – unstandardized dataset for proper raw count vs rate comparisons
standardized_df – standardized dataset via RobustScaler()for ML methods

Research Questions

With the data preprocessed and ready, we can now analyze the data to uncover patterns and insights regarding the relationship between poverty, employment, and crime rates across Philippine regions. For now, let's answer our research questions! For a more detailed explanation, proceed to our GitHub repository, specifically the data_analysis.ipynb!

Before explaining the answers to the research questions, do note that we tried four (4) (initially three (3) but due to uniform recommendations and subpar results... we opted to add another one) different approaches to determine and look for possible trends and correlations to properly answer the research questions. Direct Analysis, Tolerance of Gaps, Imputation and Time Series Forecasting, and Imputation + Principal Component Analysis (PCA) were the four (4) analyzed approaches. Each approach gave insights on the available data, and the last method, Imputation + PCA, served to be the most insightful for the answers. Learn more about the approaches done in our GitHub Repository.

1.) What is the correlation between poverty and employment with crime counts/rates across regions in the Philippines?

Scatter plots were done to visualize the relationships of poverty and employment with crime. We compared urban poverty, rural poverty, and employment rate to total crime.

Using these data plots, there is no obvious pattern or trend among the collection of points. Urban-Crime seems to show a slowly rising trend with the more urban population having more crimes committed. The Rural-Crime graph shows a bit of random points, and the same goes for the Employment-Crime graph. A deeper analysis must be done in order to determine the underlying trends that is not so clear to the naked eye. A correlation heatmap is done to visualize and numericize the correlations.

Analyzing this correlation heatmap, it seems that urban poverty (poverty_urban) has the greatest correlation among the other variables with respect to the total crimes (total_crime) committed in a region. Then, employment rate (employment_rate) takes the second indicator; although, note that it shows a negative correlation. Finally, rural poverty (poverty_rural) lastly follows, showing a very close to zero (0) (i.e., absent) correlation to crime. There could be many reasons for this, such as a more "relaxed" life in rural areas or provinces and other cultural reasons. For now, we will keep it as is. This is not a worry as this observed relationship and reasoning will be further discussed in the successive findings. Nevertheless, this correlation heatmap directly answers research question number 1 (RQ1). Now, let us look at a more "regional" perspective and point of view.

2.) How do regional variations in these socioeconomic and institutional factors (i.e., poverty and employment) influence differences in crime rates?

Here, we would like to analyze the regional variations or in other words, compare the values in each region. We would mainly use the Imputation and Time Series Forecasting approach for this research question. First, we will show the time series plots and compare the regions.

From these graphs, we can see that the regions generally follow the same trends. Urban poverty rose from 2018 to 2021 possibly due to the pandemic, and some regions started to recover after the peak in 2021 and declined back through 2022 to 2025. Employment rate also took a huge drop in 2020 possibly due to the pandemic, but it rose back the following year in 2021 and continued to rise or remain stagnant until 2025. The total crime per region all have a declining trend from 2018 to 2025. There does not seem to be much of an outlier except for NCR, which was affirmed in the IQR test in the EDA, approximately having 140,000 total number of crimes during 2018.

Looking at the urban poverty of the regions and comparing their total number of crimes, we can see a slight correlation or pattern concerning the region. Specifically, observing the top five (5) regions in the urban poverty category namely, region 4A, region 3, region 7, region 10, and 12, all of them except region 10 and 12 are also found in the top five (5) number of total crimes committed, but with a slight difference in order. NCR has the most, followed by region 4A, region 7, region 3, and lastly, region 11. As explained earlier, there is a slight correlation amongst the number of urban poverty and total number of crime, and this holds true for the top five (5) regions. Moving on to employment rates, all regions do not differ much from each other. Taking the bottom five (5) regions in employment rate, we get NCR, region 4A, region 1, region 3, and region 5. Out of the five (5) regions, only three (3) of them are in the top five (5) of the total number of crimes, and only two (2) are also part of the top five (5) in urban poverty. Moving forward, looking at the opposite side and checking the bottom five (5) regions in the urban poverty category, we get regions CAR, 6, 2, 1, and 8. Now, for the employment rate (highest), we get regions, 2, 9, 11, 13, 12, and for total crime (lowest), we get BARMM, CAR, region 4B, 13, 9. From here, we can see that CAR (poverty), and region 9 and 13 (employment rate) are both found in the total crime. Region 4B is the 6th least number of people classified under urban poverty and region 13 ranked 6th. Likewise, region 9 ranked 7th least, which shows a bit of a relationship to the low crimes. However, BARMM, having the least number of crimes committed, actually ranks high in the urban poverty, ranking at 7th, and ranking 4th in rural poverty, and 6th lowest employment rate. Despite these counterintuitive placements, they rank the lowest number of crimes. This could be due to many factors such as religion and the 2014 peace deal ending the war in the region and the continuous seek to end conflicts.

Awesome! Now, let's look at the averages throughout the years in a bar graph to compare them easier.

Here, we can see the average crime rate seem to be higher in three (3) regions, namely NCR, Region 7 (Central Visayas), and Region 11 (Davao Region). There seems to be some relation on the location of a region with crime, which can be due to employment and poverty rates associated to the said locations. One thing to notice about this "Big 3" is that they are also relatively high on the urban poverty category in their respective regions, which can support the correlation found in RQ1. Stay tuned though as it will be shown in the Machine Learning Model section that the NCR region becomes an important feature for tree-based models!

3.) To what extent can improvements in these factors (e.g., reduced poverty, increased employment) predict a decrease in crime rates across regions?

This will be answered using our developed models. Feeling excited? Okay then... proceed to the Machine Learning Model section!

Hypothesis Testing

This section focuses on evaluating our proposed null hypotheses through statistical hypothesis testing. We employed multiple analytical approaches to robustly assess the relationships between poverty, employment, and crime rates across Philippine regions.

a.) An Overview

If both the independent feature (i.e., poverty and/or employment feature) and the target are approximately normally distributed, we carry out a Pearson correlation test. Otherwise, a Spearman correlation test is conducted. This approach allows us to quickly identify linear or monotonic relationships between each feature and the target variable. Similarly, group comparison tests are performed to determine whether the target variable differs significantly between subgroups. For this, we first split each feature into two groups based on quantiles (Low vs. High). Then, we apply Levene’s test to check for equality of variances. If variances are equal, we use an independent t-test (parametric); if not, we use a Mann–Whitney U test (non-parametric). This approach is used because the t-test assumes equal variances and normally distributed data, while the Mann–Whitney U test is robust to unequal variances and non-normal distributions. By dynamically selecting the appropriate test, we ensure that our conclusions about differences or associations are statistically valid regardless of the underlying distribution of the data.

b.) The Turnout

Performing hypothesis tests to determine if poverty levels and/or employment levels have a significant effect on crime rates across Philippine regions, we took a branched granular approach. By granular, we mean that we did not only focus on raw numbers of poverty and employment to determine how they affect crime in the country as a whole. In actuality, we also considered rates of poverty and employment across regions because intuitively (without data in mind yet…), regions that are larger would seem to have a bigger scale and thus, a denser proportion in poverty, employment, and crime numbers. But does scaling regions fairly have a significant effect on crime rates or were raw counts enough to predict crime counts nationally?

i.) Counts

Throughout the four (4) approaches (note that we added Imputation + PCA to the initial 3 approaches due to statistical suggestion), it was clear that through hypothesis testing, poverty_urban and employment_rate reveal to be significant factors that have an effect on crime counts. With a significance level of 0.05, we effectively rejected the null hypotheses that poverty (only urban, not rural) and employment have no significant effect on crime counts across Philippine regions. This revelation is consistent across hypothesis testing through four (4) approaches and thus, we’ll interpret this as a robust and accurate conclusion!

Hypothesis Testing Numeric Table for Crime Counts

Hypothesis Testing Summary for Crime Counts

ii.) Rates

Contrary to counts, we determined from the four (4) approaches that rates aren’t accurately predicted by the socioeconomic factors due to perhaps a lack in data or some other tangible. So before hypothesis testing, it was etched in our minds that similar to the run approaches, hypothesis testing would reveal some inconsistencies. Unsurprisingly, this expectation was explicitly reveal in our hypothesis testing. After running hypothesis tests focused on rates, we showed that only employment_rate had a resounding rejection to the null hypothesis of employment having no significant effect on crime rates. However, compared to counts, this conclusion was inconsistent across the four (4) approaches because the direct analysis approach failed to reject the hypothesis. Hence, as expected, poverty and employment rates may probably have a significant effect on crime rates but with the means available to us, rates would deem to be inconclusive.

Hypothesis Testing Numeric Table for Crime Rates

Hypothesis Testing Summary for Crime Rates

c.) The Verdict

After hypothesis testing, we can now confidently conclude the following:

H₀,₁ (Urban Poverty): Poverty has no significant effect on crime → Rejected
H₀,₁ (Rural Poverty): Poverty has no significant effect on crime → Failed to reject
H₀,₂: Employment rate has no significant effect on crime → Rejected

Machine Learning Model

As far as we've explored, we haven't answered all of our research questions! To answer the final research question (RQ3), we will be needing predictive modeling! As per usual, please check our GitHub reposity for a more detailed explanation.

3.) To what extent can improvements in these factors (e.g., reduced poverty, increased employment) predict a decrease in crime rates across regions?

Looping back to the final research question, the machine learning model will determine if improvements in lowering poverty and increasing employment will affect the number of crimes. To evaluate whether socioeconomic indicators can predict crime across Philippine regions, multiple machine learning models were tested using cross-validated R² scores. Two (2) separate targets were evaluated namely, Crime Counts and Crime Rates (per capita)

Linear Regression, Ridge Regression, Lasso Regression, Random Forest, and Gradient Boosting were used and validated using R². Linear Regression was chosen to look for any linear relationships between the variables. Ridge and Lasso Regression were chosen to reduce multicollinearity since poverty and employment may be correlated as seen from the heatmap. Random Forest and Gradient Boosting were also chosen to incorporate nonlinear relationships and regional differences.

Image of the code to finetune the models

To further ensure accurate and the best results, some simple tuning was done to the hyperparameters as seen in the picture above. The results are seen below. Again, for the full code and implementation of the models, please visit our Github repository.

Image of the results of the machine learning models results

This was also done to predict the crime rates using the same models, however, none of them showed to be effective or reliable.

From the results and validation using R², we can determine that Ridge and Lasso performed the best among the models with a score of 0.315 and 0.254 respectively. They performed the best due to Ridge reducing coefficient variance and Lasso setting irrelevant coefficients to zero (feature selection). This applies to this study since socioeconomic indicators such as poverty and employment share an overlapping variance, and regularization prevents overfitting while keeping the meaningful predictors. However, it is important to note that these ML outputs do not provide feature importance for Ridge and Lasso since key predictors came from hypothesis testing and not ML coefficients. This means that the main drivers were determined by the hypothesis testing and these models are only used for prediction. They work hand in hand to give meaning to the data.

These graphs show the performance and accuracy of the models. The more dots that are near or close to the dotted line means that it has a more accurate prediction.

Going back to the research question, the importance of a feature was retrieved from the tree-based models.

From here we can see that the population_estimate_15_over was the biggest factor or indicator for the tree models, and the NCR region came in as the 2nd most important factor. On the other hand, our main socioeconomic indicators, urban poverty and employment rate, are minimal contributors compared to the main ones. We simulated and tried to predict the effect of improving the socioeconomic factors by decreasing urban poverty by 10% and increasing employment rate by 5%. The predicted crime counts was modest with a ~32% variance (R² ≈ 0.32). This means that improvements in these factors reduce crime, but the remaining 68% are covered by other social or cultural factors. To summarize, the simulation confirms that both factors do affect the total number of crime, but it is limited as other unmeasured factors also contribute significantly to the total number of crime.

The Regional Crime Puzzle

Overview

Background

Problem

Solution

Research Questions

Hypotheses

Null Hypotheses

Alternative Hypotheses

Action Plan

Data Collection

Employment

Poverty

Crime

Unified

Methodology

Data Preprocessing

Research Questions

Hypothesis Testing

Machine Learning Model

Results & Conclusion

Team

Alessandro Crisostomo

Gerard Salao

Bryan Uy