Propensity Score Matching for Impact Evaluation (Person-Level Matching)
Source:R/impact_matching.R
propensity_score_matching.RdPerforms propensity score matching to identify comparable control units for treatment effect estimation at the person level. This function first aggregates event-level data to person-level characteristics, performs matching on people, then returns all events for matched individuals. This is crucial for employment data where we want to match people (not individual employment spells) based on their stable characteristics.
Usage
propensity_score_matching(
data,
treatment_var = "is_treated",
person_id_var = "cf",
matching_variables,
exact_match_vars = NULL,
person_aggregation = "first",
method = "nearest",
ratio = 1,
caliper = NULL,
replace = FALSE,
estimand = "ATT",
link = "logit",
distance_metric = "glm",
missing_data_action = "complete_cases",
imputation_method = "median",
min_complete_cases = 0.5,
max_missing_proportion = 0.3,
factor_level_threshold = 5,
verbose = TRUE
)Arguments
- data
A data.table containing treatment and control observations (can be event-level)
- treatment_var
Character. Name of treatment indicator variable. Default: "is_treated"
- person_id_var
Character. Name of person identifier variable (e.g., "cf"). Default: "cf"
- matching_variables
Character vector. Variables to include in propensity score model
- exact_match_vars
Character vector. Variables requiring exact matches. Default: NULL
- person_aggregation
Character. How to aggregate person characteristics: "first", "last", "mode", "mean". Default: "first"
- method
Character. Matching method: "nearest", "optimal", "full", "genetic". Default: "nearest"
- ratio
Numeric. Ratio of control to treatment units. Default: 1
- caliper
Numeric. Maximum allowable distance for matches. Default: NULL (no caliper)
- replace
Logical. Allow replacement in matching? Default: FALSE
- estimand
Character. Target estimand: "ATT", "ATE", "ATC". Default: "ATT"
- link
Character. Link function for propensity model: "logit", "probit". Default: "logit"
- distance_metric
Character. Distance metric: "glm", "gam", "gbm", "randomforest". Default: "glm"
- missing_data_action
Character. How to handle missing data: "complete_cases" (default), "impute", "exclude_vars"
- imputation_method
Character. For numeric variables: "median" (default), "mean". For categorical: "mode"
- min_complete_cases
Numeric. Minimum proportion of complete cases required to proceed (0-1). Default: 0.5
- max_missing_proportion
Numeric. Maximum proportion of missing values allowed per variable (0-1). Default: 0.3
- factor_level_threshold
Numeric. Minimum observations per factor level. Default: 5
- verbose
Logical. Print detailed missing data diagnostics? Default: TRUE
Value
A list containing:
- matched_data
Data.table with ALL events for matched individuals (both treated and control). Includes event_time column with values: "pre" (pre-treatment), "post" (post-treatment), "control" (control group), or NA
- matched_persons
Data.table with person-level characteristics used for matching
- match_matrix
Matrix showing which persons were matched
- propensity_scores
Propensity scores for all persons
- balance_before
Balance statistics before matching (person-level)
- balance_after
Balance statistics after matching (person-level)
- match_summary
Summary of matching procedure
- common_support
Information about common support region
- data_quality_report
Report on missing data handling and data quality issues
- aggregation_report
Report on person-level aggregation process
Examples
if (FALSE) { # \dontrun{
# Process employment data first
employment_data <- vecshift(raw_employment_data)
# Add treatment indicator (e.g., policy intervention)
employment_data[, is_treated := some_treatment_condition]
# Basic person-level propensity score matching
ps_match <- propensity_score_matching(
data = employment_data,
person_id_var = "cf",
matching_variables = c("age", "education", "sector", "region"),
exact_match_vars = c("gender"),
person_aggregation = "first",
method = "nearest",
ratio = 2,
missing_data_action = "complete_cases",
min_complete_cases = 0.7
)
# Result contains all events for matched persons
matched_employment_data <- ps_match$matched_data
person_characteristics <- ps_match$matched_persons
# Advanced matching with imputation for missing values
ps_match_imputed <- propensity_score_matching(
data = employment_data,
person_id_var = "cf",
matching_variables = c("age", "education", "prior_employment", "wage"),
person_aggregation = "mode", # Use mode for categorical variables
method = "optimal",
caliper = 0.1,
distance_metric = "gbm",
missing_data_action = "impute",
imputation_method = "median"
)
} # }