Skip to content

ENH: Include line number and number of fields when read_csv() callable raises ParserWarning #61838

Open
@matthewgottlieb

Description

@matthewgottlieb

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish I could use pandas to detect and repair issues in a CSV file, but raise an informative warning when an unrepairable issue is encountered.

I have written a function which identifies common issues (e.g. the field delimiter being improperly used within a field) and checks surrounding fields to estimate the original intent of the data, but when the issue cannot be identified with this logic, the function would return the original line and the user should be directed to the problematic line.

Feature Description

Given a CSV with bad lines (e.g. line 3 having an extra "E"):

id,field_1,field_2
101,A,B
102,C,D,E
103,F,G

read_csv() will, with all defaults (on_bad_lines='error'), raise a ParserError:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

With on_bad_lines='warn', it will raise a ParserWarning, with the same helpful information:

<stdin>:1: ParserWarning: Skipping line 3: expected 3 fields, saw 4

However, when a using a callable (e.g. on_bad_lines=line_fixer), the ParserWarning message is very generic, not indicating the line number, expected fields, nor seen fields:

>>> import pandas as pd
>>> def line_fixer(line):
...     return [1, 2, 3, 4, 5]
...
>>> df = pd.read_csv('test.csv', engine='python', on_bad_lines=line_fixer)
<stdin>:1: ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False.

Including these details would allow the user to find and fix the input CSV manually.

Alternative Solutions

  • Pre-process the CSV file separately from the read_csv() function.
  • Pass line number and expected field count to the callable function, which can raise its own descriptive warning.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions