RawDataReader
Factory function for reading and processing instrument data in AeroViz.
Overview
RawDataReader is a factory function that provides a unified interface for reading and processing data from various scientific instruments. It automatically handles data loading, quality control, and time series processing.
Function Signature
AeroViz.rawDataReader.RawDataReader
RawDataReader(instrument: str, path: Path | str, reset: bool | str = False, qc: bool | str = True, start: datetime | str = None, end: datetime | str = None, mean_freq: str | None = None, size_range: tuple[float, float] | None = None, fill_missing: bool = True, ignored_status_errors: list[str] | None = None, output_dir: Path | str | None = None, output_prefix: str | None = None, save_pkl: bool = True, save_intermediate_csv: bool = True, save_report: bool = True, quiet: bool = False, log_level: Literal['DEBUG', 'INFO', 'WARNING', 'ERROR'] = 'INFO', **kwargs)
Factory function to instantiate the appropriate reader module for a given instrument and return the processed data over the specified time range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instrument
|
str
|
The instrument name for which to read data, must be a valid key in the meta dictionary |
required |
path
|
Path or str
|
The directory where raw data files for the instrument are stored |
required |
reset
|
bool or str
|
Data processing control mode: False (default) - Use existing processed data if available True - Force reprocess all data from raw files 'append' - Add new data to existing processed data |
False
|
qc
|
bool or str
|
Quality control and rate calculation mode: True (default) - Apply QC and calculate overall rates False - Skip QC and return raw data only str - Calculate rates at specified intervals: 'W' - Weekly rates 'MS' - Month start rates 'QS' - Quarter start rates 'YS' - Year start rates Can add number prefix (e.g., '2MS' for bi-monthly) |
True
|
start
|
datetime or str
|
Start time for filtering the data. If omitted, starts at the first timestamp the files contain. |
None
|
end
|
datetime or str
|
End time for filtering the data. If omitted, ends at the last timestamp
the files contain. Omit both |
None
|
mean_freq
|
str
|
Resampling frequency for averaging the output (e.g. '1h', '30min', '1D'). If omitted, the data is returned at its native resolution — no resampling. Useful for already-aggregated / second-hand sources (e.g. EPA, IGAC, Minion, VOC, BAM1020). |
None
|
size_range
|
tuple[float, float]
|
Size range in nanometers (min_size, max_size) for SMPS/APS data filtering |
None
|
append_stats
|
bool
|
SMPS/APS only. The reader returns the dN/dlogDp distribution (diameters
as columns). When True, the derived summary statistics (total / GMD /
GSD / mode, per weighting and mode) are appended as extra columns of the
returned frame. The default (False) keeps the return value a clean PSD
matrix so it can be passed straight to |
False
|
fill_missing
|
bool
|
Time-grid coverage of the output:
True - reindex/pad out to the full requested [start, end] range
(historical behaviour; a short file can become a large mostly-NaN
frame).
False - clamp the grid to the data's actual coverage, so the output
never extends past what the files contain. Use |
True
|
ignored_status_errors
|
list
|
Whitelist of statuses that should NOT be treated as a Status Error
during QC, for operator-known benign warnings — without rewriting the
raw files. Supported on every instrument with a status check; the
whitelist is interpreted in that instrument's status mode (entries
that don't fit are skipped, so the same call is safe across readers):
- SMPS (text): comma-separated string tokens; a row is accepted when
every token is the OK value or whitelisted, e.g.
|
None
|
output_dir
|
Path or str
|
Directory for all output files (pkl, csv, log, report).
Default: |
None
|
output_prefix
|
str
|
Prefix for output file names (e.g., |
None
|
save_pkl
|
bool
|
Whether to save pickle cache files. Existing pickles are still read
when |
True
|
save_intermediate_csv
|
bool
|
Whether to save intermediate |
True
|
save_report
|
bool
|
Whether to save |
True
|
quiet
|
bool
|
Suppress all console output (progress bar, timeline, log messages). Log file is still written. |
False
|
log_level
|
(DEBUG, INFO, WARNING, ERROR)
|
Logging level for the log file (default: 'INFO') |
'DEBUG'
|
**kwargs
|
Additional arguments to pass to the reader module |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Processed data with specified QC and time range. Reader metadata is attached to
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If QC mode or mean_freq format is invalid |
TypeError
|
If parameters are of incorrect type |
KeyError
|
If instrument name is not found in the supported instruments list |
FileNotFoundError
|
If path does not exist or cannot be accessed |
See Also
AeroViz.rawDataReader.core.AbstractReader A abstract reader class for reading raw data from different instruments
Examples:
>>> from AeroViz import RawDataReader
>>>
>>> # Using string inputs
>>> df_ae33 = RawDataReader(
... instrument='AE33',
... path='/path/to/your/data/folder',
... reset=True,
... qc='1MS',
... start='2024-01-01',
... end='2024-06-30',
... mean_freq='1h',
... )
>>> # Using Path and datetime objects
>>> from pathlib import Path
>>> from datetime import datetime
>>>
>>> df_ae33 = RawDataReader(
... instrument='AE33',
... path=Path('/path/to/your/data/folder'),
... reset=True,
... qc='1MS',
... start=datetime(2024, 1, 1),
... end=datetime(2024, 6, 30),
... mean_freq='1h',
... )
Basic Usage
start, end, and mean_freq are all optional. Omit start/end to read the
files' full coverage, and omit mean_freq to keep the data at its native
resolution (no resampling — this is the default):
from pathlib import Path
from datetime import datetime
from AeroViz import RawDataReader
# Minimal — full coverage, native resolution
data = RawDataReader(
instrument='AE33',
path=Path('/path/to/data'),
)
# Bounded range with hourly averaging
data = RawDataReader(
instrument='AE33',
path=Path('/path/to/data'),
start=datetime(2024, 2, 1), # optional
end=datetime(2024, 8, 31), # optional
mean_freq='1h' # optional — resample to hourly means
)
Behaviour change
mean_freq no longer defaults to '1h'. The default is now no
resampling (native resolution); pass mean_freq explicitly to average.
start / end are also optional now.
Result metadata (df.attrs)
Every result carries provenance and coverage metadata in df.attrs (full list
in the function signature above). With the default fill_missing=True the frame
is padded to the requested range, so df.attrs['coverage_*'] is the quickest way
to see what the files actually contained:
df.attrs['coverage_start'] # first row backed by real data
df.attrs['coverage_end'] # last row backed by real data (None if none in range)
df.attrs['requested_start'] # what you asked for (omitted when not given)
df.attrs['n_files'] # how many raw files were read
df.attrs['raw_freq'] # native resolution, auto-detected per file
Pass fill_missing=False to clamp the output grid to that coverage instead of
padding it to the requested range.
More Examples
Scenario 1: Basic Usage with NEPH Instrument
neph_data = RawDataReader(
instrument='NEPH',
path=Path('/path/to/your/data/folder'),
reset=True,
start=datetime(2024, 2, 1),
end=datetime(2024, 4, 30),
mean_freq='1h'
)
Console Output:
╔════════════════════════════════════════════════════════════════════════════════╗
║ Reading NEPH RAW DATA from 2024-02-01 00:00:00 to 2024-04-30 23:59:59 ║
╚════════════════════════════════════════════════════════════════════════════════╝
▶ Reading NEPH files ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 file_name.dat
▶ Scatter Coe. (550 nm)
├─ Sample Rate : 100.0%
├─ Valid Rate : 100.0%
└─ Total Rate : 100.0%
Expected Output:
- Hourly averaged NEPH data for the entire year.
- Will include scattering coefficients and other NEPH-related metrics.
Scenario 2: AE33 with Quality Control and Rate Calculation
ae33_data = RawDataReader(
instrument='AE33',
path=Path('/path/to/your/data/folder'),
reset=True,
qc='1MS', # print qc each month
start=datetime(2024, 1, 1),
end=datetime(2024, 8, 31),
mean_freq='1h',
)
Console Output:
╔════════════════════════════════════════════════════════════════════════════════╗
║ Reading AE33 RAW DATA from 2024-02-01 00:00:00 to 2024-05-31 23:59:59 ║
╚════════════════════════════════════════════════════════════════════════════════╝
▶ Reading AE33 files ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 AE33_AE33-S07-00599_20240225.dat
AE33_AE33-S07-00599_20240704.dat may not be a whole daily data. Make sure the file is correct. # some warming or
AE33_AE33-S07-00599_20240711.dat may not be a whole daily data. Make sure the file is correct. # error print
▶ Processing: 2024-02-01 to 2024-02-29
▶ BC Mass Conc. (880 nm)
├─ Sample Rate : 26.3%
├─ Valid Rate : 99.5%
└─ Total Rate : 26.1%
▶ Processing: 2024-03-01 to 2024-03-31
▶ BC Mass Conc. (880 nm)
├─ Sample Rate : 100.0%
├─ Valid Rate : 100.0%
└─ Total Rate : 100.0%
▶ Processing: 2024-04-01 to 2024-04-30
▶ BC Mass Conc. (880 nm)
├─ Sample Rate : 100.0%
├─ Valid Rate : 100.0%
└─ Total Rate : 100.0%
▶ Processing: 2024-05-01 to 2024-05-31
▶ BC Mass Conc. (880 nm)
├─ Sample Rate : 100.0%
├─ Valid Rate : 100.0%
└─ Total Rate : 100.0%
Expected Output:
- Hourly AE33 data with quality control applied monthly.
- Includes black carbon concentrations and absorption coefficients.
- Will generate a CSV file with the processed data.
Scenario 3: SMPS with Specific Time Range
smps_data = RawDataReader(
instrument='SMPS',
path=Path('/path/to/your/data/folder'),
start=datetime(2024, 2, 1),
end=datetime(2024, 8, 31),
mean_freq='30min',
size_range=(11.8, 593.5) # user input size range
)
Console Output:
╔════════════════════════════════════════════════════════════════════════════════╗
║ Reading SMPS RAW DATA from 2024-02-01 00:00:00 to 2024-08-31 23:59:59 ║
╚════════════════════════════════════════════════════════════════════════════════╝
▶ Reading SMPS files ━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00 240817.txt
SMPS file: 240816.txt is not match the default size range (11.8, 593.5), it is (11.0, 593.5) # print the unmatch file
▶ Bins
├─ Sample Rate : 1.7%
├─ Valid Rate : 93.3%
└─ Total Rate : 1.6%
Expected Output:
- SMPS data for the summer months (June to August).
- 30-minute averaged data points.
- Includes particle size distribution information.
Advanced Features
Size Range Filtering
For size-resolved instruments (SMPS, APS, GRIMM):
data = RawDataReader(
instrument="SMPS",
path="data/",
start="2024-01-01",
end="2024-01-31",
size_range=(10, 500) # nm
)
Quality Control and Rate Calculation
data = RawDataReader(
instrument='AE33',
path=Path('/path/to/data'),
reset=True,
qc='1MS', # Calculate and print QC rates monthly
start=datetime(2024, 1, 1),
end=datetime(2024, 12, 31),
)
Example console output:
▶ Processing: 2024-02-01 to 2024-02-29
▶ BC Mass Conc. (880 nm)
├─ Sample Rate : 26.3%
├─ Valid Rate : 99.5%
└─ Total Rate : 26.1%
Output Files
After processing, the following files are generated in the {instrument}_outputs directory:
_read_{instrument}_raw.csv: Merged raw data with original time resolution_read_{instrument}_raw.pkl: Raw data in pickle format_read_{instrument}.csv: Quality controlled data_read_{instrument}.pkl: QC data in pickle formatOutput_{instrument}: Final processed data file{instrument}.log: Processing log file
Supported Instruments
For detailed specifications of supported instruments, see Instruments API Reference.
See Also
- Base Class API - Documentation for the abstract base class
- Quality Control API - Details about quality control implementation