AbstractReader
The AbstractReader class is the foundation of AeroViz's data reading system, providing a standardized interface for
reading and processing aerosol instrument data.
Core Architecture
AbstractReader serves as the base class for all instrument-specific readers in AeroViz. It defines the common interface and provides shared functionality for data processing, quality control, and output formatting.
Overview
The AbstractReader implements a consistent workflow for all aerosol instruments:
- Data Ingestion - Read raw instrument files
- Format Detection - Automatically identify data structure
- Quality Control - Apply built-in validation and filtering
- Standardization - Convert to unified output format
- Metadata Handling - Preserve instrument and measurement metadata
Usage Pattern
While you can use AbstractReader directly, it's typically accessed through the RawDataReader factory function which
automatically selects the appropriate reader based on your instrument type.
Key Features
- Flexible Input Handling - Supports various file formats and structures
- Built-in Quality Control - Configurable data validation and filtering
- Metadata Preservation - Maintains instrument configuration and measurement context
- Extensible Design - Easy to subclass for new instruments
- Error Handling - Robust error reporting and recovery
Implementation Note
AbstractReader is an abstract base class. For actual data reading, use instrument-specific implementations or the
RawDataReader factory function.
API Reference
AeroViz.rawDataReader.core.AbstractReader
Bases: ABC
Abstract class for reading raw data from different instruments.
This class serves as a base class for reading raw data from various instruments. Each instrument
should have a separate class that inherits from this class and implements the abstract methods.
The abstract methods are _raw_reader and _QC.
The class handles file management, including reading from and writing to pickle files, and implements quality control measures. It can process data in both batch and streaming modes.
Attributes:
| Name | Type | Description |
|---|---|---|
nam |
str
|
Name identifier for the reader class |
path |
Path
|
Path to the raw data files |
meta |
dict
|
Metadata configuration for the instrument |
logger |
ReaderLogger
|
Custom logger instance for the reader |
reset |
bool
|
Flag to indicate whether to reset existing processed data |
append |
bool
|
Flag to indicate whether to append new data to existing processed data |
qc |
bool or str
|
Quality control settings |
qc_freq |
str or None
|
Frequency for quality control calculations |
Initialize the AbstractReader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path or str
|
Path to the directory containing raw data files |
required |
reset
|
bool or str
|
If True, forces re-reading of raw data If 'append', appends new data to existing processed data |
False
|
qc
|
bool or str
|
If True, performs quality control If str, specifies the frequency for QC calculations |
True
|
**kwargs
|
dict
|
Additional keyword arguments: raw_freq : str Override raw data frequency (e.g., '6min', '1h'). If not set, frequency is auto-inferred from the data. drop_outlier_dates : bool, default=False Stray timestamps far outside the data's bulk (e.g. a year-2000 row in 2023 data) are always detected and warned about, since they balloon the native grid. By default they are kept (the warning explains how to fix the source); set True to drop them automatically before the grid is built. log_level : str Logging level for the log file quiet : bool If True, suppresses all console output |
{}
|
Notes
Creates necessary output directories and initializes logging system. Sets up paths for pickle files, CSV files, and report outputs.
Attributes
Functions
_QC
abstractmethod
Abstract method for quality control processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame containing raw data |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Quality controlled data with QC_Flag column |
Notes
Must be implemented by child classes to handle instrument-specific QC. This method should only check raw data quality (status, range, completeness). Derived parameter validation should be done in _process().
__call__
Process data for a specified time range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start
|
datetime
|
Start time for data processing; defaults to the data's first timestamp |
None
|
end
|
datetime
|
End time for data processing; defaults to the data's last timestamp |
None
|
mean_freq
|
str
|
Frequency for resampling the output; if None, no resampling is done and the data is returned at its native resolution |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Processed and resampled data for the specified time range |
Notes
The processed data is also saved to a CSV file.
_cache_is_current
True if the cached frame was written by the current cache format.
_flag_outlier_dates
Detect and warn about stray timestamps; drop them only if asked.
A single bad row — e.g. a 2000-01-01 stamp in otherwise-2023 data —
stretches the canonical native grid (built over the data's own min->max
in _read_raw_files, before any requested range applies) across the
whole bogus span, inflating the cached frame to millions of NaN rows
even when the caller only asked for 2023. Such stamps are almost always
a source-data error, so by default we warn and tell the user how to fix
it rather than silently changing their data; pass
drop_outlier_dates=True to have them excluded automatically.
_generate_report
Calculate and log data quality rates for different time periods.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_data
|
DataFrame
|
Raw data before quality control |
required |
qc_data
|
DataFrame
|
Data after quality control |
required |
qc_flag
|
Series
|
QC flag series indicating validity of each row |
None
|
Notes
Calculates rates for specified QC frequency if set. Updates the quality report with calculated rates.
_load_or_parse
Return the canonical parsed (raw, qc) frames, using the pkl cache when it exists and is current.
Canonical = snapped to the native grid over the files' own coverage,
NOT padded to any requested range. Parse provenance (n_files,
raw_freq, freq_mixed) is persisted in df.attrs so a cache
hit restores it onto self. The requested range / fill_missing
is applied later, in _run.
_outlier_process
Process outliers in the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
_df
|
DataFrame
|
Input DataFrame containing potential outliers |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with outliers processed |
Notes
Implementation depends on specific instrument requirements.
_partition_compatible_scans
Drop frames whose scan schema differs from the dominant group.
Default is a no-op — overridden by readers (currently SMPS) where the
same instrument can export at different size-bin grids depending on
the host software version (AIM 10.3 .TXT vs AIM 11.x .CSV). The outer
join inside pd.concat happily concatenates frames with disjoint
columns, but the NaN holes break per-bin completeness QC. The
well-defined repair is to treat each grid as its own scan: keep the
majority group, drop the minority and tell the user which files were
skipped so they can re-run them in isolation if they want both.
df_list and files are aligned and contain only successfully
parsed entries.
_process
Process data to calculate derived parameters.
This method is called after _QC() to calculate instrument-specific derived parameters (e.g., absorption coefficients, AAE, SAE).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Quality-controlled DataFrame with QC_Flag column |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with derived parameters added and QC_Flag updated |
Notes
Default implementation returns the input unchanged. Override in child classes to implement instrument-specific processing.
The method should: 1. Skip calculation for rows where QC_Flag != 'Valid' (optional optimization) 2. Calculate derived parameters 3. Validate derived parameters and update QC_Flag if invalid
_raw_reader
abstractmethod
Abstract method to read raw data files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
Path or str
|
Path to the raw data file |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Raw data read from the file |
Notes
Must be implemented by child classes to handle specific file formats.
_read_raw_files
Read and process raw data files.
Returns:
| Type | Description |
|---|---|
tuple[DataFrame | None, DataFrame | None]
|
Tuple containing: - Raw data DataFrame or None - Quality controlled DataFrame or None |
Notes
Handles file reading and initial processing.
_restore_parse_meta
Pull parse provenance off a cached frame back onto self (cache hit).
_run
Main execution method for data processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
user_start
|
datetime
|
Start time for processing |
required |
user_end
|
datetime
|
End time for processing |
required |
Returns:
| Type | Description |
|---|---|
tuple[DataFrame, DataFrame]
|
Raw and quality-controlled frames for the requested range. |
Notes
Two layers. _load_or_parse returns the canonical parsed frames
(from the pkl cache when valid, else by reading the raw files). The
presentation step below — grid placement to the requested range, with
fill_missing — runs on every call, so a cache hit honours the
current call's range/fill_missing instead of replaying whatever was
stored. Parse provenance restored by _load_or_parse feeds the
df.attrs stamp in __call__.
_save_data
Save processed data to files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_data
|
DataFrame
|
Raw data to save |
required |
qc_data
|
DataFrame
|
Quality controlled data to save |
required |
Notes
Saves data in both pickle and CSV formats.
_stamp
Attach reader metadata to df.attrs just before returning.
Always records provenance (instrument, station, coverage, requested
range, native frequency). When with_qc is True it additionally
records the output frequency and the overall QC rates; the plain raw
path (qc=False) gets provenance only.
See core.metadata for why this is the single, final stamping point.
_stamp_parse_meta
Persist parse provenance into df.attrs so it survives the pkl cache.
_timeIndex_process
Process time index of the DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
_df
|
DataFrame
|
Input DataFrame to process |
required |
user_start
|
datetime
|
User-specified start time |
None
|
user_end
|
datetime
|
User-specified end time |
None
|
append_df
|
DataFrame
|
DataFrame to append to |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with processed time index |
Notes
Frequency is resolved once per run in _read_raw_files (per-file
detection, see self._resolved_freq); this method only places the
data on that grid via to_grid — snapping off-grid timestamps to
their nearest bin without the duplicate-fill of method='nearest'.
progress_reading
Context manager for tracking file reading progress.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
files
|
list
|
List of files to process |
required |
Yields:
| Type | Description |
|---|---|
Progress
|
Progress bar object for tracking |
Notes
Uses rich library for progress display.
reorder_dataframe_columns
staticmethod
Reorder DataFrame columns according to specified lists.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input DataFrame |
required |
order_lists
|
list[list]
|
Lists specifying column order |
required |
keep_others
|
bool
|
If True, keeps unspecified columns at the end |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with reordered columns |
update_qc_flag
staticmethod
Update QC_Flag column for rows matching the mask.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame with QC_Flag column |
required |
mask
|
Series
|
Boolean mask indicating rows to flag |
required |
flag_name
|
str
|
Name of the flag to add |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with updated QC_Flag column |
Related Documentation
- RawDataReader Factory - High-level interface for instrument data reading
- Quality Control - Data validation and filtering options
- Supported Instruments - Available instrument implementations
Quick Example
from AeroViz import RawDataReader
from datetime import datetime
# Using the factory function (recommended)
data = RawDataReader(
instrument='AE33',
path='/path/to/data',
start=datetime(2024, 1, 1),
end=datetime(2024, 12, 31)
)
# Direct usage (advanced - for custom implementations)
from AeroViz.rawDataReader.core import AbstractReader
class MyInstrumentReader(AbstractReader):
nam = 'MyInstrument'
def _raw_reader(self, file):
# Custom file reading logic
pass
def _QC(self, df):
# Custom QC logic
return df