Skip to content

RawDataReader

Factory function for reading and processing instrument data in AeroViz.

Overview

RawDataReader is a factory function that provides a unified interface for reading and processing data from various scientific instruments. It automatically handles data loading, quality control, and time series processing.

Function Signature

AeroViz.rawDataReader.RawDataReader

RawDataReader(instrument: str, path: Path | str, reset: bool | str = False, qc: bool | str = True, start: datetime | str = None, end: datetime | str = None, mean_freq: str | None = None, size_range: tuple[float, float] | None = None, fill_missing: bool = True, ignored_status_errors: list[str] | None = None, output_dir: Path | str | None = None, output_prefix: str | None = None, save_pkl: bool = True, save_intermediate_csv: bool = True, save_report: bool = True, quiet: bool = False, log_level: Literal['DEBUG', 'INFO', 'WARNING', 'ERROR'] = 'INFO', **kwargs)

Factory function to instantiate the appropriate reader module for a given instrument and return the processed data over the specified time range.

Parameters:

Name Type Description Default
instrument str

The instrument name for which to read data, must be a valid key in the meta dictionary

required
path Path or str

The directory where raw data files for the instrument are stored

required
reset bool or str

Data processing control mode: False (default) - Use existing processed data if available True - Force reprocess all data from raw files 'append' - Add new data to existing processed data

False
qc bool or str

Quality control and rate calculation mode: True (default) - Apply QC and calculate overall rates False - Skip QC and return raw data only str - Calculate rates at specified intervals: 'W' - Weekly rates 'MS' - Month start rates 'QS' - Quarter start rates 'YS' - Year start rates Can add number prefix (e.g., '2MS' for bi-monthly)

True
start datetime or str

Start time for filtering the data. If omitted, starts at the first timestamp the files contain.

None
end datetime or str

End time for filtering the data. If omitted, ends at the last timestamp the files contain. Omit both start and end to get full coverage; check df.attrs['coverage_start'/'coverage_end'] for what was found.

None
mean_freq str

Resampling frequency for averaging the output (e.g. '1h', '30min', '1D'). If omitted, the data is returned at its native resolution — no resampling. Useful for already-aggregated / second-hand sources (e.g. EPA, IGAC, Minion, VOC, BAM1020).

None
size_range tuple[float, float]

Size range in nanometers (min_size, max_size) for SMPS/APS data filtering

None
append_stats bool

SMPS/APS only. The reader returns the dN/dlogDp distribution (diameters as columns). When True, the derived summary statistics (total / GMD / GSD / mode, per weighting and mode) are appended as extra columns of the returned frame. The default (False) keeps the return value a clean PSD matrix so it can be passed straight to psd_stats / merge_psd / SizeDist; the statistics are always also written to {prefix}_stats.csv alongside the _dNdlogDp / _dSdlogDp / _dVdlogDp distribution files.

False
fill_missing bool

Time-grid coverage of the output: True - reindex/pad out to the full requested [start, end] range (historical behaviour; a short file can become a large mostly-NaN frame). False - clamp the grid to the data's actual coverage, so the output never extends past what the files contain. Use df.attrs for the requested-vs-actual range.

True
ignored_status_errors list

Whitelist of statuses that should NOT be treated as a Status Error during QC, for operator-known benign warnings — without rewriting the raw files. Supported on every instrument with a status check; the whitelist is interpreted in that instrument's status mode (entries that don't fit are skipped, so the same call is safe across readers): - SMPS (text): comma-separated string tokens; a row is accepted when every token is the OK value or whitelisted, e.g. ignored_status_errors=['Low aerosol flow', 'Neutralizer not active']. - Aurora / NEPH (numeric): numeric status codes treated as OK, e.g. [4, 16]. - AE33 / AE43 / BC1054 / MA350 / TEOM (bitwise): integer error codes/bits to drop from the error definition, e.g. [536870912] to ignore the TEOM "Dryer A" status bit. - APS (binary_string): integer bit masks cleared before testing, e.g. [1, 2].

None
output_dir Path or str

Directory for all output files (pkl, csv, log, report). Default: path/{instrument}_outputs/

None
output_prefix str

Prefix for output file names (e.g., 'NZ_smps'NZ_smps.csv). Default: output_{instrument}

None
save_pkl bool

Whether to save pickle cache files. Existing pickles are still read when reset=False regardless of this setting.

True
save_intermediate_csv bool

Whether to save intermediate _read_*_qc.csv / _read_*_raw.csv files.

True
save_report bool

Whether to save report.json.

True
quiet bool

Suppress all console output (progress bar, timeline, log messages). Log file is still written.

False
log_level (DEBUG, INFO, WARNING, ERROR)

Logging level for the log file (default: 'INFO')

'DEBUG'
**kwargs

Additional arguments to pass to the reader module

{}

Returns:

Type Description
DataFrame

Processed data with specified QC and time range.

Reader metadata is attached to df.attrs (survives pickling and resample in pandas >= 2):

  • Always: instrument, station, source_path, n_files, coverage_start / coverage_end (the real file span, ignoring NaN padding), requested_start / requested_end, raw_freq, aeroviz_version, processed_at.
  • When qc is enabled, additionally: mean_freq, qc_applied, qc_freq, acquisition_rate, yield_rate, total_rate.

coverage_* is None when no data falls in the requested range.

Raises:

Type Description
ValueError

If QC mode or mean_freq format is invalid

TypeError

If parameters are of incorrect type

KeyError

If instrument name is not found in the supported instruments list

FileNotFoundError

If path does not exist or cannot be accessed

See Also

AeroViz.rawDataReader.core.AbstractReader A abstract reader class for reading raw data from different instruments

Examples:

>>> from AeroViz import RawDataReader
>>>
>>> # Using string inputs
>>> df_ae33 = RawDataReader(
...     instrument='AE33',
...     path='/path/to/your/data/folder',
...     reset=True,
...     qc='1MS',
...     start='2024-01-01',
...     end='2024-06-30',
...     mean_freq='1h',
... )
>>> # Using Path and datetime objects
>>> from pathlib import Path
>>> from datetime import datetime
>>>
>>> df_ae33 = RawDataReader(
...     instrument='AE33',
...     path=Path('/path/to/your/data/folder'),
...     reset=True,
...     qc='1MS',
...     start=datetime(2024, 1, 1),
...     end=datetime(2024, 6, 30),
...     mean_freq='1h',
... )

Basic Usage

start, end, and mean_freq are all optional. Omit start/end to read the files' full coverage, and omit mean_freq to keep the data at its native resolution (no resampling — this is the default):

from pathlib import Path
from datetime import datetime
from AeroViz import RawDataReader

# Minimal — full coverage, native resolution
data = RawDataReader(
    instrument='AE33',
    path=Path('/path/to/data'),
)

# Bounded range with hourly averaging
data = RawDataReader(
    instrument='AE33',
    path=Path('/path/to/data'),
    start=datetime(2024, 2, 1),  # optional
    end=datetime(2024, 8, 31),   # optional
    mean_freq='1h'               # optional — resample to hourly means
)

Behaviour change

mean_freq no longer defaults to '1h'. The default is now no resampling (native resolution); pass mean_freq explicitly to average. start / end are also optional now.

Result metadata (df.attrs)

Every result carries provenance and coverage metadata in df.attrs (full list in the function signature above). With the default fill_missing=True the frame is padded to the requested range, so df.attrs['coverage_*'] is the quickest way to see what the files actually contained:

df.attrs['coverage_start']   # first row backed by real data
df.attrs['coverage_end']     # last row backed by real data (None if none in range)
df.attrs['requested_start']  # what you asked for (omitted when not given)
df.attrs['n_files']          # how many raw files were read
df.attrs['raw_freq']         # native resolution, auto-detected per file

Pass fill_missing=False to clamp the output grid to that coverage instead of padding it to the requested range.

More Examples

Scenario 1: Basic Usage with NEPH Instrument

neph_data = RawDataReader(
    instrument='NEPH',
    path=Path('/path/to/your/data/folder'),
    reset=True,
    start=datetime(2024, 2, 1),
    end=datetime(2024, 4, 30),
    mean_freq='1h'
)

Console Output:

╔════════════════════════════════════════════════════════════════════════════════╗
║     Reading NEPH RAW DATA from 2024-02-01 00:00:00 to 2024-04-30 23:59:59      ║
╚════════════════════════════════════════════════════════════════════════════════╝
▶ Reading NEPH files ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 file_name.dat
        ▶ Scatter Coe. (550 nm)
            ├─ Sample Rate    :   100.0%
            ├─ Valid  Rate    :   100.0%
            └─ Total  Rate    :   100.0%

Expected Output:

  • Hourly averaged NEPH data for the entire year.
  • Will include scattering coefficients and other NEPH-related metrics.

Scenario 2: AE33 with Quality Control and Rate Calculation

ae33_data = RawDataReader(
    instrument='AE33',
    path=Path('/path/to/your/data/folder'),
    reset=True,
    qc='1MS',  # print qc each month
    start=datetime(2024, 1, 1),
    end=datetime(2024, 8, 31),
    mean_freq='1h',
)

Console Output:

╔════════════════════════════════════════════════════════════════════════════════╗
║     Reading AE33 RAW DATA from 2024-02-01 00:00:00 to 2024-05-31 23:59:59      ║
╚════════════════════════════════════════════════════════════════════════════════╝
▶ Reading AE33 files ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 AE33_AE33-S07-00599_20240225.dat
     AE33_AE33-S07-00599_20240704.dat may not be a whole daily data. Make sure the file is correct.  # some warming or 
     AE33_AE33-S07-00599_20240711.dat may not be a whole daily data. Make sure the file is correct.  # error print
    ▶ Processing: 2024-02-01 to 2024-02-29
        ▶ BC Mass Conc. (880 nm)
            ├─ Sample Rate    :   26.3%
            ├─ Valid  Rate    :   99.5%
            └─ Total  Rate    :   26.1%
    ▶ Processing: 2024-03-01 to 2024-03-31
        ▶ BC Mass Conc. (880 nm)
            ├─ Sample Rate    :  100.0%
            ├─ Valid  Rate    :  100.0%
            └─ Total  Rate    :  100.0%
    ▶ Processing: 2024-04-01 to 2024-04-30
        ▶ BC Mass Conc. (880 nm)
            ├─ Sample Rate    :  100.0%
            ├─ Valid  Rate    :  100.0%
            └─ Total  Rate    :  100.0%
    ▶ Processing: 2024-05-01 to 2024-05-31
        ▶ BC Mass Conc. (880 nm)
            ├─ Sample Rate    :  100.0%
            ├─ Valid  Rate    :  100.0%
            └─ Total  Rate    :  100.0%

Expected Output:

  • Hourly AE33 data with quality control applied monthly.
  • Includes black carbon concentrations and absorption coefficients.
  • Will generate a CSV file with the processed data.

Scenario 3: SMPS with Specific Time Range

smps_data = RawDataReader(
    instrument='SMPS',
    path=Path('/path/to/your/data/folder'),
    start=datetime(2024, 2, 1),
    end=datetime(2024, 8, 31),
    mean_freq='30min',
    size_range=(11.8, 593.5)  # user input size range
)

Console Output:

╔════════════════════════════════════════════════════════════════════════════════╗
║     Reading SMPS RAW DATA from 2024-02-01 00:00:00 to 2024-08-31 23:59:59      ║
╚════════════════════════════════════════════════════════════════════════════════╝
▶ Reading SMPS files ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 240817.txt
    SMPS file: 240816.txt is not match the default size range (11.8, 593.5), it is (11.0, 593.5)  # print the unmatch file
        ▶ Bins
            ├─ Sample Rate    :    1.7%
            ├─ Valid  Rate    :   93.3%
            └─ Total  Rate    :    1.6%

Expected Output:

  • SMPS data for the summer months (June to August).
  • 30-minute averaged data points.
  • Includes particle size distribution information.

Advanced Features

Size Range Filtering

For size-resolved instruments (SMPS, APS, GRIMM):

data = RawDataReader(
    instrument="SMPS",
    path="data/",
    start="2024-01-01",
    end="2024-01-31",
    size_range=(10, 500)  # nm
)

Quality Control and Rate Calculation

data = RawDataReader(
    instrument='AE33',
    path=Path('/path/to/data'),
    reset=True,
    qc='1MS',  # Calculate and print QC rates monthly
    start=datetime(2024, 1, 1),
    end=datetime(2024, 12, 31),
)

Example console output:

▶ Processing: 2024-02-01 to 2024-02-29
    ▶ BC Mass Conc. (880 nm)
        ├─ Sample Rate    :   26.3%
        ├─ Valid  Rate    :   99.5%
        └─ Total  Rate    :   26.1%

Output Files

After processing, the following files are generated in the {instrument}_outputs directory:

  1. _read_{instrument}_raw.csv: Merged raw data with original time resolution
  2. _read_{instrument}_raw.pkl: Raw data in pickle format
  3. _read_{instrument}.csv: Quality controlled data
  4. _read_{instrument}.pkl: QC data in pickle format
  5. Output_{instrument}: Final processed data file
  6. {instrument}.log: Processing log file

Supported Instruments

For detailed specifications of supported instruments, see Instruments API Reference.

See Also