Performance issue with ofNamedCsvRecord(Path, Charset) #148

tsegall · 2025-06-20T02:20:58Z

tsegall
Jun 20, 2025

I attempted to use ofNamedCsvRecord(Path, Charset) as this entry point supports BOM detection. Unfortunately it is dramatically slower than the version that takes a Reader. I ended up with the code below - so only utilizing the first version if I needed to detect a BOM. Is there a better approach. Thanks, in advance.

		if (settings.withBOM)
			csv = CsvReader.builder()
				.fieldSeparator(settings.delimiter)
				.quoteCharacter(settings.quoteCharacter)
				.detectBomHeader(true)
				.skipEmptyLines(false)
				.ofNamedCsvRecord(Paths.get(filename), Charset.forName(options.charset));
		else
			csv = CsvReader.builder()
					.fieldSeparator(settings.delimiter)
					.quoteCharacter(settings.quoteCharacter)
					.skipEmptyLines(false)
					.ofNamedCsvRecord(new BufferedReader(new InputStreamReader(new FileInputStream(new File(filename)), options.charset)));

Answered by tsegall

Jun 21, 2025

Many thanks for your quick response. You are correct I was processing my test suite which has a relatively large number (~14,000) of relatively small files (~1000 lines). I will play around with your suggestion.

View full answer

osiegmar · 2025-06-20T07:58:21Z

osiegmar
Jun 20, 2025
Maintainer

That's an interesting and unexpected observation. Unfortunately, I was not able to reproduce it with a JMH benchmark.

I have one idea that might explain the cause of the performance difference. The ofNamedCsvRecord(Path, Charset) method opens the file twice: once to detect the BOM and then again to read the actual content with the detected encoding. While this can improve throughput when reading large files, it may lead to a performance penalty when processing thousands of small files—especially when accessing files over a network share or with antivirus software running in the background that scans the file on each access.

You may want to try the following:

var csv = CsvReader.builder()
    .skipEmptyLines(false)
    .detectBomHeader(true)
    .ofNamedCsvRecord(Files.newInputStream(Path.of(filename)), CHARSET)

This also allows you to read the file with BOM detection enabled while only opening the file once. Usually, this is not the recommended approach, as the entire file is read through the BomInputStream that is only used to detect the BOM.

Let me know if this improves the performance in your case.

For reference, here is the JMH benchmark I used to test different approaches:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class BomPerformanceTest {

    private static final String filename = "/tmp/test.csv";
    private static final Charset CHARSET = StandardCharsets.UTF_8;

    @Benchmark
    public long fromFile() throws IOException {
        try (var csv = CsvReader.builder()
            .skipEmptyLines(false)
            .detectBomHeader(true)
            .ofNamedCsvRecord(Path.of(filename), CHARSET)) {

            return csv.stream().count();
        }
    }

    @Benchmark
    public long fromInputStream() throws IOException {
        try (var csv = CsvReader.builder()
            .skipEmptyLines(false)
            .detectBomHeader(true)
            .ofNamedCsvRecord(Files.newInputStream(Path.of(filename)), CHARSET)) {

            return csv.stream().count();
        }
    }

    @Benchmark
    public long fromReader() throws IOException {
        try (var csv = CsvReader.builder()
            .skipEmptyLines(false)
            .ofNamedCsvRecord(new BufferedReader(new InputStreamReader(new FileInputStream(filename), CHARSET)))) {

            return csv.stream().count();
        }
    }

}

The results on my machine with a sample CSV file containing five million records and a size of 595 MiB were as follows:

Benchmark                           Mode  Cnt     Score    Error  Units
BomPerformanceTest.fromFile         avgt   25  1597,283 ± 18,771  ms/op
BomPerformanceTest.fromInputStream  avgt   25  1622,219 ± 20,199  ms/op
BomPerformanceTest.fromReader       avgt   25  1648,792 ± 12,150  ms/op

The method fromInputStream() performed slightly worse than fromFile(), as expected, because it uses the BomInputStream throughout the entire file. The method fromReader performed even a bit worse, as it uses double buffering (the unnecessary BufferedReader) – but with only 3.2%, the GAP is negligible.

0 replies

tsegall · 2025-06-21T02:11:22Z

tsegall
Jun 21, 2025
Author

Many thanks for your quick response. You are correct I was processing my test suite which has a relatively large number (~14,000) of relatively small files (~1000 lines). I will play around with your suggestion.

0 replies

osiegmar · 2025-06-21T07:03:29Z

osiegmar
Jun 21, 2025
Maintainer

Thanks for the confirmation! I created #149 to fix this in FastCSV 4 - which will be released by tomorrow.

It would be nice if you could mark my answer above as an answer ("Mark as answer").

0 replies

tsegall · 2025-06-21T15:31:47Z

tsegall
Jun 21, 2025
Author

Your suggestion worked perfectly. Will attempt an upgrade to 4.0 when it is released. Many thanks for the library - I am just converting over from univocity to your library.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance issue with ofNamedCsvRecord(Path, Charset) #148

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Performance issue with ofNamedCsvRecord(Path, Charset) #148

Uh oh!

tsegall Jun 20, 2025

Replies: 4 comments

Uh oh!

osiegmar Jun 20, 2025 Maintainer

Uh oh!

tsegall Jun 21, 2025 Author

Uh oh!

osiegmar Jun 21, 2025 Maintainer

Uh oh!

tsegall Jun 21, 2025 Author

tsegall
Jun 20, 2025

osiegmar
Jun 20, 2025
Maintainer

tsegall
Jun 21, 2025
Author

osiegmar
Jun 21, 2025
Maintainer

tsegall
Jun 21, 2025
Author