Quality Control
The QualityControl class provides comprehensive data quality assessment and outlier detection methods for aerosol
instrument data.
Overview
Quality control is essential for ensuring data reliability in aerosol measurements. The QualityControl class offers various statistical methods to identify and handle outliers, invalid measurements, and data quality issues.
Key Features
- QCFlagBuilder System - Declarative rule-based quality control
- Multiple Outlier Detection Methods - N-sigma, IQR, time-aware rolling methods
- Flexible Thresholds - Configurable parameters for different data types
- Time-Aware Processing - Considers temporal patterns in data
- Status Code Filtering - Handles instrument-specific error codes
- Data Completeness Checks - Validates temporal coverage requirements
QCFlagBuilder System
All instruments use the declarative QCFlagBuilder system for quality control. This system provides:
- Declarative Rules - Define QC rules as simple dataclass instances
- Consistent Processing - All instruments use
QC_Flaginternally for quality control - Transparent Results - Failed rules are clearly listed in the flag
- Clean Output - Final output has invalid data set to NaN,
QC_Flagcolumn removed
QCRule Dataclass
from dataclasses import dataclass
from typing import Callable
import pandas as pd
@dataclass
class QCRule:
name: str # Rule identifier (e.g., "Invalid BC")
condition: Callable[[pd.DataFrame], pd.Series] # Returns True where data fails
description: str # Human-readable description
QCFlagBuilder Class
class QCFlagBuilder:
def __init__(self, rules: list[QCRule]):
self.rules = rules
def build(self, df: pd.DataFrame) -> pd.Series:
"""Build QC flag column from rules."""
flags = pd.Series("Valid", index=df.index)
for rule in self.rules:
mask = rule.condition(df)
for idx in df.index[mask]:
current = flags.loc[idx]
if current == "Valid":
flags.loc[idx] = rule.name
else:
flags.loc[idx] = f"{current}, {rule.name}"
return flags
Usage Example
from AeroViz.rawDataReader.core.qc import QCRule, QCFlagBuilder
# Define QC rules
rules = [
QCRule(
name="Invalid Range",
condition=lambda df: (df['value'] < 0) | (df['value'] > 1000),
description="Value outside valid range (0-1000)"
),
QCRule(
name="Missing Data",
condition=lambda df: df['value'].isna(),
description="Value is missing"
),
]
# Build flags
builder = QCFlagBuilder(rules)
df['QC_Flag'] = builder.build(df)
# Results:
# - "Valid" if all rules pass
# - "Invalid Range" if only range check fails
# - "Invalid Range, Missing Data" if both fail
Instrument QC Rules Summary
| Instrument | QC Rules |
|---|---|
| AE33/AE43 | Status Error, Invalid BC, Invalid AAE, Insufficient |
| BC1054 | Status Error, Invalid BC, Invalid AAE, Insufficient |
| MA350 | Status Error, Invalid BC, Invalid AAE, Insufficient |
| SMPS | Status Error, Insufficient, Low Total, High Bin, High Large Bin |
| APS | Status Error, Insufficient, Low Total, High Total |
| NEPH/Aurora | No Data, Invalid Scat Value, Invalid Scat Rel, Insufficient |
| TEOM | High Noise, Negative/Zero, NV > Total, Invalid Vol Frac, Std Outlier, Insufficient |
| BAM1020 | Invalid Range, IQR Outlier |
| OCEC | Invalid Range, Below MDL, IQR Outlier, Missing OC |
| IGAC | Mass Closure, Missing Main, Below MDL, Ion Balance |
| EPA | Negative Value |
Methods
Statistical Outlier Detection
N-Sigma Method
from AeroViz.rawDataReader.core.qc import QualityControl
qc = QualityControl()
cleaned_data = qc.n_sigma(df, std_range=3)
Interquartile Range (IQR) Method
# Basic IQR
cleaned_data = qc.iqr(df)
# With log transformation
cleaned_data = qc.iqr(df, log_dist=True)
Time-Aware Rolling IQR
# Rolling IQR with time awareness
cleaned_data = qc.time_aware_rolling_iqr(
df,
window_size='24h',
iqr_factor=3.0,
min_periods=5
)
Advanced Quality Control
Bidirectional Trend Analysis
# Detect outliers while considering data trends
outlier_mask = qc.bidirectional_trend_std_QC(
df,
window_size='6h',
std_factor=3.0,
trend_window='30min',
trend_factor=2.0
)
Status Code Filtering
# Filter based on instrument status codes
error_mask = qc.filter_error_status(
df,
error_codes=[1, 2, 4, 16],
special_codes=[384, 1024]
)
filter_error_status supports four status_type modes — bitwise (AE33,
AE43, BC1054, MA350, TEOM), numeric (Aurora, NEPH), text (SMPS) and
binary_string (APS). The ignored_values whitelist lets you suppress
operator-known benign statuses without rewriting raw files; it is interpreted
in the active mode (string tokens / numeric codes / integer bits / bit masks),
and entries that don't fit a mode are skipped. This is surfaced on
RawDataReader as the ignored_status_errors parameter.
# bitwise: drop one warning bit from the error definition
# (e.g. ignore the TEOM "Dryer A" status bit 0x20000000)
error_mask = qc.filter_error_status(
df,
error_codes=[1 << b for b in range(32)],
status_column='status',
status_type='bitwise',
ignored_values=[536870912],
)
# text: whitelist comma-split tokens (SMPS)
error_mask = qc.filter_error_status(
df,
status_column='Instrument Errors',
status_type='text',
ok_value='',
ignored_values=['Low aerosol flow', 'Neutralizer not active'],
)
Completeness Validation
# Check hourly data completeness
completeness_mask = qc.hourly_completeness_QC(
df,
freq='6min',
threshold=0.75 # Require 75% data availability
)
Usage Examples
Basic Outlier Removal
from AeroViz.rawDataReader.core.qc import QualityControl
import pandas as pd
# Load your data
df = pd.read_csv('instrument_data.csv', index_col=0, parse_dates=True)
# Initialize QC
qc = QualityControl()
# Apply basic outlier detection
cleaned_df = qc.n_sigma(df, std_range=3)
Advanced Time-Aware Processing
# For time series with trends
cleaned_df = qc.time_aware_rolling_iqr(
df,
window_size='12h',
iqr_factor=2.5,
min_periods=10
)
# With bidirectional trend consideration
outlier_mask = qc.bidirectional_trend_std_QC(
df,
window_size='6h',
std_factor=3.0,
trend_window='1h'
)
# Apply the mask
final_df = df.where(~outlier_mask, np.nan)
Instrument-Specific QC
# For instruments with status codes (e.g., AE33)
error_mask = qc.filter_error_status(
df,
error_codes=[1, 2, 4, 16, 32], # Common error codes
special_codes=[384, 1024, 2048] # Specific error conditions
)
# Remove error data
qc_df = df.where(~error_mask, np.nan)
API Reference
AeroViz.rawDataReader.core.qc.QualityControl
A class providing various methods for data quality control and outlier detection
Functions
_ensure_dataframe
staticmethod
Ensure input data is in DataFrame format
_transform_if_log
staticmethod
Transform data to log scale if required
n_sigma
classmethod
Detect outliers using n-sigma method
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data |
required |
std_range
|
int
|
Number of standard deviations to use as threshold |
5
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Cleaned DataFrame with outliers masked as NaN |
iqr
classmethod
Detect outliers using Interquartile Range (IQR) method
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data |
required |
log_dist
|
bool
|
Whether to apply log transformation to data |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Cleaned DataFrame with outliers masked as NaN |
time_aware_rolling_iqr
classmethod
time_aware_rolling_iqr(df: DataFrame, window_size: str = '24h', log_dist: bool = False, iqr_factor: float = 5, min_periods: int = 5) -> DataFrame
Detect outliers using rolling time-aware IQR method with handling for initial periods
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data |
required |
window_size
|
str
|
Size of the rolling window |
'24h'
|
log_dist
|
bool
|
Whether to apply log transformation to data |
False
|
iqr_factor
|
float
|
The factor by which to multiply the IQR |
3
|
min_periods
|
int
|
Minimum number of observations required in window |
4
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Cleaned DataFrame with outliers masked as NaN |
time_aware_std_QC
time_aware_std_QC(df: DataFrame, time_window: str = '6h', std_factor: float = 3.0, min_periods: int = 4) -> DataFrame
Time-aware outlier detection using rolling standard deviation
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data |
required |
time_window
|
str
|
Rolling window size |
'6h'
|
std_factor
|
float
|
Standard deviation multiplier (e.g., 3 means 3σ) |
3.0
|
min_periods
|
int
|
Minimum number of observations required in window |
4
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Quality controlled DataFrame with outliers marked as NaN |
bidirectional_trend_std_QC
classmethod
bidirectional_trend_std_QC(df: DataFrame, window_size: str = '6h', std_factor: float = 3.0, trend_window: str = '30min', trend_factor: float = 2, min_periods: int = 4) -> Series
Perform quality control using standard deviation with awareness of both upward and downward trends.
This method identifies outliers considering both upward and downward trends in the data, applying more lenient criteria when consistent trends are detected.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data frame with time series (QC_Flag column is now optional) |
required |
window_size
|
str
|
Size of the rolling window for std calculation |
'6h'
|
std_factor
|
float
|
Base factor for standard deviation threshold |
3.0
|
trend_window
|
str
|
Window for trend detection |
'30min'
|
trend_factor
|
float
|
Factor to increase std_factor when trends are detected |
2
|
min_periods
|
int
|
Minimum number of observations in window |
4
|
Returns:
| Type | Description |
|---|---|
Series
|
Boolean mask where True indicates outliers |
filter_error_status
staticmethod
filter_error_status(_df, error_codes=None, special_codes=None, return_mask=True, status_column='Status', status_type='bitwise', ok_value=None, ignored_values=None)
Filter data based on error status codes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
_df
|
DataFrame
|
Input DataFrame |
required |
error_codes
|
list or array - like
|
Codes indicating errors (for 'bitwise' type) |
None
|
special_codes
|
list or array - like
|
Special codes to handle differently (exact match) |
None
|
return_mask
|
bool
|
If True, returns a boolean mask where True indicates errors; If False, returns filtered DataFrame |
True
|
status_column
|
str
|
Name of the status column in DataFrame |
'Status'
|
status_type
|
str
|
Type of status check: - 'bitwise': Use bitwise AND to check error codes (AE33, AE43, BC1054, MA350) - 'numeric': Check if status != ok_value (TEOM, Aurora, NEPH) - 'text': Check if status != ok_value as string (SMPS) - 'binary_string': Parse binary string and check if > 0 (APS) |
'bitwise'
|
ok_value
|
any
|
The value indicating OK status (for 'numeric', 'text' types) - For 'numeric': typically 0 - For 'text': typically 'Normal Scan' |
None
|
ignored_values
|
list
|
Whitelist of statuses to suppress (treat as OK) without editing the raw files — e.g. an operator-known benign warning. Interpretation is mode-specific; entries that don't fit a mode are silently skipped so a whitelist meant for one instrument is harmless if it reaches another. Defaults to None (no whitelist; behaviour unchanged).
|
None
|
Returns:
| Type | Description |
|---|---|
Union[DataFrame, Series]
|
If return_mask=True: boolean Series with True for error points If return_mask=False: Filtered DataFrame with error points masked |
spike_detection
classmethod
spike_detection(df: DataFrame, max_change_rate: float = 3.0, min_abs_change: float = None) -> Series
Vectorized spike detection using change rate analysis.
Detects sudden unreasonable value changes while allowing legitimate gradual changes during events (pollution episodes, etc.).
This method is much faster than rolling window methods because it uses pure numpy vectorized operations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data frame with time series |
required |
max_change_rate
|
float
|
Maximum allowed ratio of current change to median absolute change. Higher values = more permissive. A value of 3.0 means a change must be 3x larger than the median change to be flagged. |
3.0
|
min_abs_change
|
float
|
Minimum absolute change required to be considered a spike. If None, uses 10% of the data's standard deviation. |
None
|
Returns:
| Type | Description |
|---|---|
Series
|
Boolean mask where True indicates detected spikes |
Notes
The algorithm: 1. Calculate absolute difference between consecutive points 2. Calculate the median absolute change (robust baseline) 3. Flag points where change > max_change_rate * median_change 4. Also detect "reversals" (spike up then immediately down)
This approach allows gradual changes during events while catching sudden spikes that are likely instrument errors.
Examples:
hourly_completeness_QC
classmethod
Check if each hour has sufficient data points.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input data frame with time series |
required |
freq
|
str
|
Data frequency (e.g., '6min') |
required |
threshold
|
float
|
Minimum required proportion of data points per hour (0-1) |
0.5
|
Returns:
| Type | Description |
|---|---|
Series
|
Boolean mask where True indicates insufficient data |
Related Documentation
- AbstractReader - Base class that uses QualityControl methods
- RawDataReader - Factory function with built-in QC options
- Instrument Documentation - Instrument-specific QC procedures