PN: Algorithmic Bias in AI-Assisted Project Management

Core Idea

LLMs trained on historical software project data inherit the biases present in that data. When deployed in PM tools, these biases can systematically disadvantage certain developers, issue types, or project contexts — creating unfair workload distributions, skewed priority assignments, and inequitable effort estimates.

Sources of Bias in PUMA’s Pipeline

1. Training Data Bias

LLMs are pretrained on internet-scale text corpora (GitHub, Stack Overflow, Reddit). This corpus:

Overrepresents English-language projects, Western tech companies, open-source (vs. enterprise)
Underrepresents non-English issue descriptions, Global South project contexts, legacy codebases
Contains historical inequity: Past project data reflects past team compositions and past priorities

PUMA manifestation: A model trained on GitHub data may perform better on English bug reports from well-documented open-source projects than on Spanish Jira issues from a legacy enterprise application.

2. Label Bias (TAWOS Dataset)

The TAWOS dataset contains human-assigned labels (issue type, priority, story points). These labels reflect the biases of the original human classifiers:

A senior developer may consistently underestimate junior-contributed issues
Priority labels may reflect political influence (e.g., VIP stakeholder requests labeled Critical regardless of actual severity)
Story point calibration varies across teams and time periods

PUMA manifestation: A model trained to predict these labels learns the human biases embedded in them.

3. Prompt Bias

Few-shot examples selected for estimation (H2) may inadvertently encode biases:

If retrieval selects examples from a specific team or period, the model anchors on those patterns
If team A consistently labeled similar work as 3 SP and team B as 8 SP, retrieval from team A provides a systematic underestimate

4. Model Bias (Pre-trained Associations)

LLMs associate certain words with certain categories based on pretraining statistics:

“Frontend” issues may be associated with lower story points in training data
“Security” issues may be associated with high priority regardless of actual severity
Developer names in issue descriptions could influence priority predictions if names correlate with seniority in historical data

Bias Taxonomy for PUMA

Bias Type	Definition	PUMA Example
Representation bias	Training data does not represent all subgroups equally	English-only triage model underperforms on multilingual issues
Measurement bias	Labels are systematically wrong for certain subgroups	Priority labels biased by stakeholder influence
Aggregation bias	Single model applied to heterogeneous subgroups	One SP estimation model for all project types
Evaluation bias	Benchmark does not reflect real deployment population	TAWOS subset may not generalize to all companies
Deployment bias	Model performs well in testing but fails in production population	Model calibrated on TAWOS, deployed on project with different vocabulary

Bias Detection in PUMA

Subgroup Analysis

Instead of reporting only aggregate Macro-F1, evaluate performance across subgroups:

# Slice performance by issue type
for issue_type in ["Bug", "Feature", "Task", "Improvement"]:
    subset = df[df["actual_type"] == issue_type]
    f1 = compute_f1(subset["predicted_type"], subset["actual_type"])
    print(f"{issue_type}: F1 = {f1:.3f}, n = {len(subset)}")
 
# Slice by project or component
for project in df["project"].unique():
    subset = df[df["project"] == project]
    mae = (subset["predicted_sp"] - subset["actual_sp"]).abs().mean()
    print(f"{project}: MAE = {mae:.2f} SP")

Fairness Metrics

Metric	Formula	Acceptable Range
Demographic parity	$	P(\hat{Y}=1\|A=0) - P(\hat{Y}=1\|A=1)
Equalized odds	Equal TPR and FPR across groups	Difference < 0.05
F1 gap across issue types	$max (F 1_{k}) - min (F 1_{k})$	< 0.15
MAE gap across projects	$max (MAE_{j}) - min (MAE_{j})$	< 2 SP

Bias Mitigation Strategies

Pre-processing (Data Level)

Resampling: Oversample underrepresented issue types or project contexts
Data augmentation: Back-translate non-English issues; paraphrase rare categories
Label cleaning: Audit and correct systematic labeling errors in TAWOS before training/evaluation

In-processing (Model Level)

Adversarial debiasing: Train model to maximize task performance while being unable to predict sensitive attributes
Constrained optimization: Impose fairness constraints (e.g., equalized odds) during fine-tuning

Post-processing (Output Level)

Threshold calibration: Adjust classification thresholds per subgroup to equalize recall
Confidence adjustment: Apply subgroup-specific scaling to model confidence scores
HITL gates: Require human review for predictions on underrepresented or low-confidence subgroups

PUMA Ethics Chapter Framing

Research Accountability

PUMA must report:

F1 scores broken down by issue type, priority level, and project source

Whether the model performs systematically better on certain issue types (e.g., Bugs vs. Tasks)

Whether story point estimation errors are evenly distributed or concentrated in specific project types

Limitations: TAWOS dataset selection bias, evaluation-to-deployment distribution shift

PN-HITL-BoundedAutonomy — HITL as bias mitigation
Ethics-Review-Log — PUMA ethics documentation
PN-Evaluation-Metrics-Comprehensive — subgroup F1 metrics
LN-Collaborating-AIAgents-2025 — field evidence on AI impact on human teams

MOCs

MOC-PUMA-Master

PUMA Vault

Explorador

Algorithmic Bias in AI-Assisted Project Management

PN: Algorithmic Bias in AI-Assisted Project Management

Sources of Bias in PUMA’s Pipeline

1. Training Data Bias

2. Label Bias (TAWOS Dataset)

3. Prompt Bias

4. Model Bias (Pre-trained Associations)

Bias Taxonomy for PUMA

Bias Detection in PUMA

Subgroup Analysis

Fairness Metrics

Bias Mitigation Strategies

Pre-processing (Data Level)

In-processing (Model Level)

Post-processing (Output Level)

PUMA Ethics Chapter Framing

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces

PUMA Vault

Explorador

Algorithmic Bias in AI-Assisted Project Management

PN: Algorithmic Bias in AI-Assisted Project Management

Sources of Bias in PUMA’s Pipeline

1. Training Data Bias

2. Label Bias (TAWOS Dataset)

3. Prompt Bias

4. Model Bias (Pre-trained Associations)

Bias Taxonomy for PUMA

Bias Detection in PUMA

Subgroup Analysis

Fairness Metrics

Bias Mitigation Strategies

Pre-processing (Data Level)

In-processing (Model Level)

Post-processing (Output Level)

PUMA Ethics Chapter Framing

Related Notes

MOCs

Vista Gráfica

Tabla de Contenidos

Retroenlaces