Calculate Weighted Median Using Optimized Algorithm
Source:R/analyze_employment_transitions.R
dot-calculate_weighted_median_optimized.RdEfficiently calculates weighted median using matrixStats::weightedMedian()
instead of the memory-intensive rep() approach. This optimization dramatically
reduces memory usage and computation time for large datasets with high-weight
observations.
Details
Performance Characteristics:
Benchmark results show substantial improvements over the rep()-based approach:
Speed: 31-59x faster (0.06ms vs 1.8-3.6ms for 1000 observations)
Memory: 96-99\
Scalability: O(n log n) complexity independent of weight magnitude
The old approach using rep(values, times = weights) creates massive
temporary vectors. For example, with 1,000 observations and average weight of
365 days (typical employment duration):
Replicated vector: 365,000 elements
Memory allocation: 2.92 MB per calculation
Two calculations per transition (from + to): 5.84 MB
100 unique transitions: 584 MB temporary memory
The optimized matrixStats::weightedMedian() approach:
Works directly with original vectors (no replication)
Memory: O(n) instead of O(n × w) where w = average weight
Numerically stable for arbitrarily large weights
Handles weighted median computation using efficient sorting algorithm
Edge Case Handling:
Empty vectors: Returns
NA_real_All NA values: Returns
NA_real_Zero weights: Returns
NA_real_Mismatched lengths: Stops with error
Negative weights: Automatically converted to absolute values by matrixStats
Optimization
This function replaces memory-intensive vector replication with direct weighted median calculation. The optimization is most beneficial when:
Weights are large (e.g., employment durations in days)
Many weighted medians need to be calculated (e.g., for each transition)
Working with memory-constrained environments
See also
weightedMedian for the underlying implementation
Examples
if (FALSE) { # \dontrun{
# Calculate weighted median of salaries by employment duration
salaries <- c(30000, 35000, 40000, 45000, 50000)
durations <- c(365, 730, 180, 90, 1095) # days employed
# Old approach (memory-intensive):
# median(rep(salaries, times = durations))
# Optimized approach:
.calculate_weighted_median_optimized(salaries, durations)
# With NA handling:
salaries_na <- c(30000, NA, 40000, 45000, 50000)
.calculate_weighted_median_optimized(salaries_na, durations, na.rm = TRUE)
} # }