Skip to content

Space in front of quotation doesn't recognise a field with multiple entries separated by common delimiter #941

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Roberto-Circit opened this issue Jul 14, 2022 · 6 comments

Comments

@Roberto-Circit
Copy link

Using papaparse 5.3.0 version.

Having an empty space in front of quotation when using field such as address with multiple fields separated by common delimiter does not parse correctly
random name, 12345, "address line 1, address line 2, code"
results.data would have an item with 5 entries
would pick up as 5 separate entries, while

random name, 12345,"address line 1, address line 2, code"
results.data would have an item with 3 entries
would work normally.

This issue only occurs with double quoted fields with multiple entries in it.
Let me know if this isn't enough to go on or if this has been fixed in the new 6.0 version

@pokoli
Copy link
Collaborator

pokoli commented Jul 14, 2022

I'm not sure to understand. Which separator and quote char are you using?
Which are the results on any of such fields?

@fractalpixel
Copy link

fractalpixel commented Aug 8, 2022

Can confirm this, ran into the exact same problem.

Example (first row has no problems, second row will be parsed as 4 columns):

"Cobra MK II","some wear, docking computer installed", 80 megacredits
"Milennium Falcon", "surface rust, light beam damage", 150 megacredits

The quotation character used is the default (double quotation mark ") and the separator is comma, but the problem does occur with other delimiters as well.

If the field starts with a space (before the quotation character), the quotation character is not recognized, and any delimiter characters inside the quoted value will be interpreted as delimiters. (The parsing code probably assumes that fields separated by delimiters will not have any preceding (or trailing) whitespace, and assumes that if the quote character doesn't immediately follow the delimiter, the field is not quoted).

CSV files with optional whitespace around delimiter characters and/or data values exist, it's common especially in hand-edited CSV files and CSV files where columns are aligned with spaces for readability (in addition to using a delimiter character).

Using the transform configuration option to remove starting and trailing whitespace doesn't help, as the transform is run after the quotes are processed.

Setting the field delimiter to ", " (comma followed by a space) breaks in cases where there is no space after the comma (or multiple spaces, if someone tried to manually align columns in addition to using a delimiter character).

Maybe add an option to automatically remove whitespace around unquoted values and outside quoted string values. Using that option would fix this problem, and is in any case something that needs to be done for CSV files where optional whitespace is present (using a transform function or when processing the data later) (e.g. the rhird column in the first row of my example is " 80 megacredits" when parsed, but the user probably wanted a result such as "80 megacredits" (with whitespace trimmed)). Some csv files might rely on storing whitespace around unquoted values, hence why this probably should be an option, but it could be on by default, as that seems to be the most common usecase. (Whitespace inside quoted strings should of course always be preserved).

@pokoli
Copy link
Collaborator

pokoli commented Aug 8, 2022

I'm not sure we should implement something to fix hand-edited files.
If someone edits a file and it breaks the format is normall that is not correctly parsed.

@Roberto-Circit
Copy link
Author

Maybe add an option to automatically remove whitespace around unquoted values and outside quoted string values. Using that option would fix this problem, and is in any case something that needs to be done for CSV files where optional whitespace is present (using a transform function or when processing the data later) (e.g. the rhird column in the first row of my example is " 80 megacredits" when parsed, but the user probably wanted a result such as "80 megacredits" (with whitespace trimmed)). Some csv files might rely on storing whitespace around unquoted values, hence why this probably should be an option, but it could be on by default, as that seems to be the most common usecase. (Whitespace inside quoted strings should of course always be preserved).

Agreed, having this option would be nice

@HarryPeach
Copy link

Also running into this issue, would be good if the library could successfully parse csv files with spaces after commas, like alternatives do

@janisdd
Copy link
Contributor

janisdd commented Sep 28, 2024

Also running into this issue, would be good if the library could successfully parse csv files with spaces after commas, like alternatives do

This is interesting, because it is not clear what to do in such cases. Papaparse simply makes a decision.

The problem with random name, 12345, "address line 1, address line 2, code" can be simplified to

a, b, "c"

The question is, should c be parsed to just c or c (with a leading space).

  • The quotes clearly indicate that there should be no leading space
  • the space after the , and " indicate that there should be a leading space (as in , b)

I think there is no right or wrong, you can go either way.
Papaparse chooses to ignore the quotes and treat them like a normal character rather than a special character.
(This is also the reason why the field in the example contains the leading space)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants