Quality Control
The QualityControl class provides comprehensive data quality assessment and outlier detection methods for aerosol
instrument data.
Overview
Quality control is essential for ensuring data reliability in aerosol measurements. The QualityControl class offers various statistical methods to identify and handle outliers, invalid measurements, and data quality issues.
Key Features
- QCFlagBuilder System - Declarative rule-based quality control
- Multiple Outlier Detection Methods - N-sigma, IQR, time-aware rolling methods
- Flexible Thresholds - Configurable parameters for different data types
- Time-Aware Processing - Considers temporal patterns in data
- Status Code Filtering - Handles instrument-specific error codes
- Data Completeness Checks - Validates temporal coverage requirements
QCFlagBuilder System
All instruments use the declarative QCFlagBuilder system for quality control. This system provides:
- Declarative Rules - Define QC rules as simple dataclass instances
- Consistent Processing - All instruments use
QC_Flaginternally for quality control - Transparent Results - Failed rules are clearly listed in the flag
- Clean Output - Final output has invalid data set to NaN,
QC_Flagcolumn removed
QCRule Dataclass
from dataclasses import dataclass
from typing import Callable
import pandas as pd
@dataclass
class QCRule:
name: str # Rule identifier (e.g., "Invalid BC")
condition: Callable[[pd.DataFrame], pd.Series] # Returns True where data fails
description: str # Human-readable description
QCFlagBuilder Class
class QCFlagBuilder:
def __init__(self, rules: list[QCRule]):
self.rules = rules
def build(self, df: pd.DataFrame) -> pd.Series:
"""Build QC flag column from rules."""
flags = pd.Series("Valid", index=df.index)
for rule in self.rules:
mask = rule.condition(df)
for idx in df.index[mask]:
current = flags.loc[idx]
if current == "Valid":
flags.loc[idx] = rule.name
else:
flags.loc[idx] = f"{current}, {rule.name}"
return flags
Usage Example
from AeroViz.rawDataReader.core.qc import QCRule, QCFlagBuilder
# Define QC rules
rules = [
QCRule(
name="Invalid Range",
condition=lambda df: (df['value'] < 0) | (df['value'] > 1000),
description="Value outside valid range (0-1000)"
),
QCRule(
name="Missing Data",
condition=lambda df: df['value'].isna(),
description="Value is missing"
),
]
# Build flags
builder = QCFlagBuilder(rules)
df['QC_Flag'] = builder.build(df)
# Results:
# - "Valid" if all rules pass
# - "Invalid Range" if only range check fails
# - "Invalid Range, Missing Data" if both fail
Instrument QC Rules Summary
| Instrument | QC Rules |
|---|---|
| AE33/AE43 | Status Error, Invalid BC, Invalid AAE, Insufficient |
| BC1054 | Status Error, Invalid BC, Invalid AAE, Insufficient |
| MA350 | Status Error, Invalid BC, Invalid AAE, Insufficient |
| SMPS | Status Error, Insufficient, Low Total, High Bin, High Large Bin |
| APS | Status Error, Insufficient, Low Total, High Total |
| NEPH/Aurora | No Data, Invalid Scat Value, Invalid Scat Rel, Insufficient |
| TEOM | High Noise, Negative/Zero, NV > Total, Invalid Vol Frac, Std Outlier, Insufficient |
| BAM1020 | Invalid Range, IQR Outlier |
| OCEC | Invalid Range, Below MDL, IQR Outlier, Missing OC |
| IGAC | Mass Closure, Missing Main, Below MDL, Ion Balance |
| EPA | Negative Value |
Methods
Statistical Outlier Detection
N-Sigma Method
from AeroViz.rawDataReader.core.qc import QualityControl
qc = QualityControl()
cleaned_data = qc.n_sigma(df, std_range=3)
Interquartile Range (IQR) Method
# Basic IQR
cleaned_data = qc.iqr(df)
# With log transformation
cleaned_data = qc.iqr(df, log_dist=True)
Time-Aware Rolling IQR
# Rolling IQR with time awareness
cleaned_data = qc.time_aware_rolling_iqr(
df,
window_size='24h',
iqr_factor=3.0,
min_periods=5
)
Advanced Quality Control
Bidirectional Trend Analysis
# Detect outliers while considering data trends
outlier_mask = qc.bidirectional_trend_std_QC(
df,
window_size='6h',
std_factor=3.0,
trend_window='30min',
trend_factor=2.0
)
Status Code Filtering
# Filter based on instrument status codes
error_mask = qc.filter_error_status(
df,
error_codes=[1, 2, 4, 16],
special_codes=[384, 1024]
)
Completeness Validation
# Check hourly data completeness
completeness_mask = qc.hourly_completeness_QC(
df,
freq='6min',
threshold=0.75 # Require 75% data availability
)
Usage Examples
Basic Outlier Removal
from AeroViz.rawDataReader.core.qc import QualityControl
import pandas as pd
# Load your data
df = pd.read_csv('instrument_data.csv', index_col=0, parse_dates=True)
# Initialize QC
qc = QualityControl()
# Apply basic outlier detection
cleaned_df = qc.n_sigma(df, std_range=3)
Advanced Time-Aware Processing
# For time series with trends
cleaned_df = qc.time_aware_rolling_iqr(
df,
window_size='12h',
iqr_factor=2.5,
min_periods=10
)
# With bidirectional trend consideration
outlier_mask = qc.bidirectional_trend_std_QC(
df,
window_size='6h',
std_factor=3.0,
trend_window='1h'
)
# Apply the mask
final_df = df.where(~outlier_mask, np.nan)
Instrument-Specific QC
# For instruments with status codes (e.g., AE33)
error_mask = qc.filter_error_status(
df,
error_codes=[1, 2, 4, 16, 32], # Common error codes
special_codes=[384, 1024, 2048] # Specific error conditions
)
# Remove error data
qc_df = df.where(~error_mask, np.nan)
API Reference
AeroViz.rawDataReader.core.qc.QualityControl
A class providing various methods for data quality control and outlier detection
Functions
_ensure_dataframe
staticmethod
Ensure input data is in DataFrame format
_transform_if_log
staticmethod
Transform data to log scale if required
n_sigma
classmethod
Detect outliers using n-sigma method
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data |
required |
std_range
|
int
|
Number of standard deviations to use as threshold |
5
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Cleaned DataFrame with outliers masked as NaN |
iqr
classmethod
Detect outliers using Interquartile Range (IQR) method
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data |
required |
log_dist
|
bool
|
Whether to apply log transformation to data |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Cleaned DataFrame with outliers masked as NaN |
time_aware_rolling_iqr
classmethod
time_aware_rolling_iqr(df: DataFrame, window_size: str = '24h', log_dist: bool = False, iqr_factor: float = 5, min_periods: int = 5) -> DataFrame
Detect outliers using rolling time-aware IQR method with handling for initial periods
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data |
required |
window_size
|
str
|
Size of the rolling window |
'24h'
|
log_dist
|
bool
|
Whether to apply log transformation to data |
False
|
iqr_factor
|
float
|
The factor by which to multiply the IQR |
3
|
min_periods
|
int
|
Minimum number of observations required in window |
4
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Cleaned DataFrame with outliers masked as NaN |
time_aware_std_QC
time_aware_std_QC(df: DataFrame, time_window: str = '6h', std_factor: float = 3.0, min_periods: int = 4) -> DataFrame
Time-aware outlier detection using rolling standard deviation
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data |
required |
time_window
|
str
|
Rolling window size |
'6h'
|
std_factor
|
float
|
Standard deviation multiplier (e.g., 3 means 3σ) |
3.0
|
min_periods
|
int
|
Minimum number of observations required in window |
4
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Quality controlled DataFrame with outliers marked as NaN |
bidirectional_trend_std_QC
classmethod
bidirectional_trend_std_QC(df: DataFrame, window_size: str = '6h', std_factor: float = 3.0, trend_window: str = '30min', trend_factor: float = 2, min_periods: int = 4) -> Series
Perform quality control using standard deviation with awareness of both upward and downward trends.
This method identifies outliers considering both upward and downward trends in the data, applying more lenient criteria when consistent trends are detected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data frame with time series (QC_Flag column is now optional) |
required |
window_size
|
str
|
Size of the rolling window for std calculation |
'6h'
|
std_factor
|
float
|
Base factor for standard deviation threshold |
3.0
|
trend_window
|
str
|
Window for trend detection |
'30min'
|
trend_factor
|
float
|
Factor to increase std_factor when trends are detected |
2
|
min_periods
|
int
|
Minimum number of observations in window |
4
|
Returns:
| Type | Description |
|---|---|
Series
|
Boolean mask where True indicates outliers |
filter_error_status
staticmethod
Filter data based on error status codes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
_df
|
DataFrame
|
Input DataFrame |
required |
error_codes
|
list or array - like
|
Codes indicating errors |
required |
special_codes
|
list or array - like
|
Special codes to handle differently |
None
|
return_mask
|
bool
|
If True, returns a boolean mask where True indicates errors; If False, returns filtered DataFrame |
True
|
Returns:
| Type | Description |
|---|---|
Union[DataFrame, Series]
|
If return_mask=True: boolean Series with True for error points If return_mask=False: Filtered DataFrame with error points masked |
spike_detection
classmethod
spike_detection(df: DataFrame, max_change_rate: float = 3.0, min_abs_change: float = None) -> Series
Vectorized spike detection using change rate analysis.
Detects sudden unreasonable value changes while allowing legitimate gradual changes during events (pollution episodes, etc.).
This method is much faster than rolling window methods because it uses pure numpy vectorized operations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data frame with time series |
required |
max_change_rate
|
float
|
Maximum allowed ratio of current change to median absolute change. Higher values = more permissive. A value of 3.0 means a change must be 3x larger than the median change to be flagged. |
3.0
|
min_abs_change
|
float
|
Minimum absolute change required to be considered a spike. If None, uses 10% of the data's standard deviation. |
None
|
Returns:
| Type | Description |
|---|---|
Series
|
Boolean mask where True indicates detected spikes |
Notes
The algorithm: 1. Calculate absolute difference between consecutive points 2. Calculate the median absolute change (robust baseline) 3. Flag points where change > max_change_rate * median_change 4. Also detect "reversals" (spike up then immediately down)
This approach allows gradual changes during events while catching sudden spikes that are likely instrument errors.
Examples:
hourly_completeness_QC
classmethod
Check if each hour has sufficient data points.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data frame with time series |
required |
freq
|
str
|
Data frequency (e.g., '6min') |
required |
threshold
|
float
|
Minimum required proportion of data points per hour (0-1) |
0.5
|
Returns:
| Type | Description |
|---|---|
Series
|
Boolean mask where True indicates insufficient data |
Related Documentation
- AbstractReader - Base class that uses QualityControl methods
- RawDataReader - Factory function with built-in QC options
- Instrument Documentation - Instrument-specific QC procedures