RawDataReader

Factory function for reading and processing instrument data in AeroViz.

Overview

RawDataReader is a factory function that provides a unified interface for reading and processing data from various scientific instruments. It automatically handles data loading, quality control, and time series processing.

Function Signature

AeroViz.rawDataReader.RawDataReader

RawDataReader(instrument: str, path: Path | str, reset: bool | str = False, qc: bool | str = True, start: datetime | str = None, end: datetime | str = None, mean_freq: str | None = None, size_range: tuple[float, float] | None = None, fill_missing: bool = True, ignored_status_errors: list[str] | None = None, output_dir: Path | str | None = None, output_prefix: str | None = None, save_pkl: bool = True, save_intermediate_csv: bool = True, save_report: bool = True, quiet: bool = False, log_level: Literal['DEBUG', 'INFO', 'WARNING', 'ERROR'] = 'INFO', **kwargs)

Factory function to instantiate the appropriate reader module for a given instrument and return the processed data over the specified time range.

Parameters:

Name	Type	Description	Default
`instrument`	`str`	The instrument name for which to read data, must be a valid key in the meta dictionary	required
`path`	`Path or str`	The directory where raw data files for the instrument are stored	required
`reset`	`bool or str`	Data processing control mode: False (default) - Use existing processed data if available True - Force reprocess all data from raw files 'append' - Add new data to existing processed data	`False`
`qc`	`bool or str`	Quality control and rate calculation mode: True (default) - Apply QC and calculate overall rates False - Skip QC and return raw data only str - Calculate rates at specified intervals: 'W' - Weekly rates 'MS' - Month start rates 'QS' - Quarter start rates 'YS' - Year start rates Can add number prefix (e.g., '2MS' for bi-monthly)	`True`
`start`	`datetime or str`	Start time for filtering the data. If omitted, starts at the first timestamp the files contain.	`None`
`end`	`datetime or str`	End time for filtering the data. If omitted, ends at the last timestamp the files contain. Omit both `start` and `end` to get full coverage; check `df.attrs['coverage_start'/'coverage_end']` for what was found.	`None`
`mean_freq`	`str`	Resampling frequency for averaging the output (e.g. '1h', '30min', '1D'). If omitted, the data is returned at its native resolution — no resampling. Useful for already-aggregated / second-hand sources (e.g. EPA, IGAC, Minion, VOC, BAM1020).	`None`
`size_range`	`tuple[float, float]`	Size range in nanometers (min_size, max_size) for SMPS/APS data filtering	`None`
`append_stats`	`bool`	SMPS/APS only. The reader returns the dN/dlogDp distribution (diameters as columns). When True, the derived summary statistics (total / GMD / GSD / mode, per weighting and mode) are appended as extra columns of the returned frame. The default (False) keeps the return value a clean PSD matrix so it can be passed straight to `psd_stats` / `merge_psd` / `SizeDist`; the statistics are always also written to `{prefix}_stats.csv` alongside the `_dNdlogDp` / `_dSdlogDp` / `_dVdlogDp` distribution files.	`False`
`fill_missing`	`bool`	Time-grid coverage of the output: True - reindex/pad out to the full requested [start, end] range (historical behaviour; a short file can become a large mostly-NaN frame). False - clamp the grid to the data's actual coverage, so the output never extends past what the files contain. Use `df.attrs` for the requested-vs-actual range.	`True`
`ignored_status_errors`	`list`	Whitelist of statuses that should NOT be treated as a Status Error during QC, for operator-known benign warnings — without rewriting the raw files. Supported on every instrument with a status check; the whitelist is interpreted in that instrument's status mode (entries that don't fit are skipped, so the same call is safe across readers): - SMPS (text): comma-separated string tokens; a row is accepted when every token is the OK value or whitelisted, e.g. `ignored_status_errors=['Low aerosol flow', 'Neutralizer not active']`. - Aurora / NEPH (numeric): numeric status codes treated as OK, e.g. `[4, 16]`. - AE33 / AE43 / BC1054 / MA350 / TEOM (bitwise): integer error codes/bits to drop from the error definition, e.g. `[536870912]` to ignore the TEOM "Dryer A" status bit. - APS (binary_string): integer bit masks cleared before testing, e.g. `[1, 2]`.	`None`
`output_dir`	`Path or str`	Directory for all output files (pkl, csv, log, report). Default: `path/{instrument}_outputs/`	`None`
`output_prefix`	`str`	Prefix for output file names (e.g., `'NZ_smps'` → `NZ_smps.csv`). Default: `output_{instrument}`	`None`
`save_pkl`	`bool`	Whether to save pickle cache files. Existing pickles are still read when `reset=False` regardless of this setting.	`True`
`save_intermediate_csv`	`bool`	Whether to save intermediate `_read__qc.csv` / `_read__raw.csv` files.	`True`
`save_report`	`bool`	Whether to save `report.json`.	`True`
`quiet`	`bool`	Suppress all console output (progress bar, timeline, log messages). Log file is still written.	`False`
`log_level`	`(DEBUG, INFO, WARNING, ERROR)`	Logging level for the log file (default: 'INFO')	`'DEBUG'`
`**kwargs`		Additional arguments to pass to the reader module	`{}`

Returns:

Type Description

DataFrame

Processed data with specified QC and time range.

Reader metadata is attached to df.attrs (survives pickling and resample in pandas >= 2):

Always: instrument, station, source_path, n_files, coverage_start / coverage_end (the real file span, ignoring NaN padding), requested_start / requested_end, raw_freq, aeroviz_version, processed_at.
When qc is enabled, additionally: mean_freq, qc_applied, qc_freq, acquisition_rate, yield_rate, total_rate.

coverage_* is None when no data falls in the requested range.

Raises:

Type	Description
`ValueError`	If QC mode or mean_freq format is invalid
`TypeError`	If parameters are of incorrect type
`KeyError`	If instrument name is not found in the supported instruments list
`FileNotFoundError`	If path does not exist or cannot be accessed

Basic Usage

start, end, and mean_freq are all optional. Omit start/end to read the files' full coverage, and omit mean_freq to keep the data at its native resolution (no resampling — this is the default):

from pathlib import Path
from datetime import datetime
from AeroViz import RawDataReader

# Minimal — full coverage, native resolution
data = RawDataReader(
    instrument='AE33',
    path=Path('/path/to/data'),
)

# Bounded range with hourly averaging
data = RawDataReader(
    instrument='AE33',
    path=Path('/path/to/data'),
    start=datetime(2024, 2, 1),  # optional
    end=datetime(2024, 8, 31),   # optional
    mean_freq='1h'               # optional — resample to hourly means
)

Behaviour change

mean_freq no longer defaults to '1h'. The default is now no resampling (native resolution); pass mean_freq explicitly to average. start / end are also optional now.

Result metadata (`df.attrs`)

Every result carries provenance and coverage metadata in df.attrs (full list in the function signature above). With the default fill_missing=True the frame is padded to the requested range, so df.attrs['coverage_*'] is the quickest way to see what the files actually contained:

df.attrs['coverage_start']   # first row backed by real data
df.attrs['coverage_end']     # last row backed by real data (None if none in range)
df.attrs['requested_start']  # what you asked for (omitted when not given)
df.attrs['n_files']          # how many raw files were read
df.attrs['raw_freq']         # native resolution, auto-detected per file

Pass fill_missing=False to clamp the output grid to that coverage instead of padding it to the requested range.

More Examples

Scenario 1: Basic Usage with NEPH Instrument

neph_data = RawDataReader(
    instrument='NEPH',
    path=Path('/path/to/your/data/folder'),
    reset=True,
    start=datetime(2024, 2, 1),
    end=datetime(2024, 4, 30),
    mean_freq='1h'
)

Console Output:

╔════════════════════════════════════════════════════════════════════════════════╗
║     Reading NEPH RAW DATA from 2024-02-01 00:00:00 to 2024-04-30 23:59:59      ║
╚════════════════════════════════════════════════════════════════════════════════╝
▶ Reading NEPH files ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 file_name.dat
        ▶ Scatter Coe. (550 nm)
            ├─ Sample Rate    :   100.0%
            ├─ Valid  Rate    :   100.0%
            └─ Total  Rate    :   100.0%

Expected Output:

Hourly averaged NEPH data for the entire year.
Will include scattering coefficients and other NEPH-related metrics.

Scenario 2: AE33 with Quality Control and Rate Calculation

ae33_data = RawDataReader(
    instrument='AE33',
    path=Path('/path/to/your/data/folder'),
    reset=True,
    qc='1MS',  # print qc each month
    start=datetime(2024, 1, 1),
    end=datetime(2024, 8, 31),
    mean_freq='1h',
)

Console Output:

╔════════════════════════════════════════════════════════════════════════════════╗
║     Reading AE33 RAW DATA from 2024-02-01 00:00:00 to 2024-05-31 23:59:59      ║
╚════════════════════════════════════════════════════════════════════════════════╝
▶ Reading AE33 files ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 AE33_AE33-S07-00599_20240225.dat
     AE33_AE33-S07-00599_20240704.dat may not be a whole daily data. Make sure the file is correct.  # some warming or 
     AE33_AE33-S07-00599_20240711.dat may not be a whole daily data. Make sure the file is correct.  # error print
    ▶ Processing: 2024-02-01 to 2024-02-29
        ▶ BC Mass Conc. (880 nm)
            ├─ Sample Rate    :   26.3%
            ├─ Valid  Rate    :   99.5%
            └─ Total  Rate    :   26.1%
    ▶ Processing: 2024-03-01 to 2024-03-31
        ▶ BC Mass Conc. (880 nm)
            ├─ Sample Rate    :  100.0%
            ├─ Valid  Rate    :  100.0%
            └─ Total  Rate    :  100.0%
    ▶ Processing: 2024-04-01 to 2024-04-30
        ▶ BC Mass Conc. (880 nm)
            ├─ Sample Rate    :  100.0%
            ├─ Valid  Rate    :  100.0%
            └─ Total  Rate    :  100.0%
    ▶ Processing: 2024-05-01 to 2024-05-31
        ▶ BC Mass Conc. (880 nm)
            ├─ Sample Rate    :  100.0%
            ├─ Valid  Rate    :  100.0%
            └─ Total  Rate    :  100.0%

Expected Output:

Hourly AE33 data with quality control applied monthly.
Includes black carbon concentrations and absorption coefficients.
Will generate a CSV file with the processed data.

Scenario 3: SMPS with Specific Time Range

smps_data = RawDataReader(
    instrument='SMPS',
    path=Path('/path/to/your/data/folder'),
    start=datetime(2024, 2, 1),
    end=datetime(2024, 8, 31),
    mean_freq='30min',
    size_range=(11.8, 593.5)  # user input size range
)

Console Output:

╔════════════════════════════════════════════════════════════════════════════════╗
║     Reading SMPS RAW DATA from 2024-02-01 00:00:00 to 2024-08-31 23:59:59      ║
╚════════════════════════════════════════════════════════════════════════════════╝
▶ Reading SMPS files ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 240817.txt
    SMPS file: 240816.txt is not match the default size range (11.8, 593.5), it is (11.0, 593.5)  # print the unmatch file
        ▶ Bins
            ├─ Sample Rate    :    1.7%
            ├─ Valid  Rate    :   93.3%
            └─ Total  Rate    :    1.6%

Expected Output:

SMPS data for the summer months (June to August).
30-minute averaged data points.
Includes particle size distribution information.

Advanced Features

Size Range Filtering

For size-resolved instruments (SMPS, APS, GRIMM):

data = RawDataReader(
    instrument="SMPS",
    path="data/",
    start="2024-01-01",
    end="2024-01-31",
    size_range=(10, 500)  # nm
)

Quality Control and Rate Calculation

data = RawDataReader(
    instrument='AE33',
    path=Path('/path/to/data'),
    reset=True,
    qc='1MS',  # Calculate and print QC rates monthly
    start=datetime(2024, 1, 1),
    end=datetime(2024, 12, 31),
)

Example console output:

▶ Processing: 2024-02-01 to 2024-02-29
    ▶ BC Mass Conc. (880 nm)
        ├─ Sample Rate    :   26.3%
        ├─ Valid  Rate    :   99.5%
        └─ Total  Rate    :   26.1%

Output Files

After processing, the following files are generated in the {instrument}_outputs directory:

_read_{instrument}_raw.csv: Merged raw data with original time resolution
_read_{instrument}_raw.pkl: Raw data in pickle format
_read_{instrument}.csv: Quality controlled data
_read_{instrument}.pkl: QC data in pickle format
Output_{instrument}: Final processed data file
{instrument}.log: Processing log file

Supported Instruments

For detailed specifications of supported instruments, see Instruments API Reference.

RawDataReader

Overview

Function Signature

AeroViz.rawDataReader.RawDataReader

Basic Usage

Result metadata (df.attrs)

More Examples

Scenario 1: Basic Usage with NEPH Instrument

Scenario 2: AE33 with Quality Control and Rate Calculation

Scenario 3: SMPS with Specific Time Range

Advanced Features

Size Range Filtering

Quality Control and Rate Calculation

Output Files

Supported Instruments

See Also

Result metadata (`df.attrs`)