-
I attempted to use ofNamedCsvRecord(Path, Charset) as this entry point supports BOM detection. Unfortunately it is dramatically slower than the version that takes a Reader. I ended up with the code below - so only utilizing the first version if I needed to detect a BOM. Is there a better approach. Thanks, in advance.
|
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
That's an interesting and unexpected observation. Unfortunately, I was not able to reproduce it with a JMH benchmark. I have one idea that might explain the cause of the performance difference. The You may want to try the following: var csv = CsvReader.builder()
.skipEmptyLines(false)
.detectBomHeader(true)
.ofNamedCsvRecord(Files.newInputStream(Path.of(filename)), CHARSET) This also allows you to read the file with BOM detection enabled while only opening the file once. Usually, this is not the recommended approach, as the entire file is read through the Let me know if this improves the performance in your case. For reference, here is the JMH benchmark I used to test different approaches: @BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
public class BomPerformanceTest {
private static final String filename = "/tmp/test.csv";
private static final Charset CHARSET = StandardCharsets.UTF_8;
@Benchmark
public long fromFile() throws IOException {
try (var csv = CsvReader.builder()
.skipEmptyLines(false)
.detectBomHeader(true)
.ofNamedCsvRecord(Path.of(filename), CHARSET)) {
return csv.stream().count();
}
}
@Benchmark
public long fromInputStream() throws IOException {
try (var csv = CsvReader.builder()
.skipEmptyLines(false)
.detectBomHeader(true)
.ofNamedCsvRecord(Files.newInputStream(Path.of(filename)), CHARSET)) {
return csv.stream().count();
}
}
@Benchmark
public long fromReader() throws IOException {
try (var csv = CsvReader.builder()
.skipEmptyLines(false)
.ofNamedCsvRecord(new BufferedReader(new InputStreamReader(new FileInputStream(filename), CHARSET)))) {
return csv.stream().count();
}
}
} The results on my machine with a sample CSV file containing five million records and a size of 595 MiB were as follows:
The method |
Beta Was this translation helpful? Give feedback.
-
Many thanks for your quick response. You are correct I was processing my test suite which has a relatively large number (~14,000) of relatively small files (~1000 lines). I will play around with your suggestion. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the confirmation! I created #149 to fix this in FastCSV 4 - which will be released by tomorrow. It would be nice if you could mark my answer above as an answer ("Mark as answer"). |
Beta Was this translation helpful? Give feedback.
-
Your suggestion worked perfectly. Will attempt an upgrade to 4.0 when it is released. Many thanks for the library - I am just converting over from univocity to your library. |
Beta Was this translation helpful? Give feedback.
Many thanks for your quick response. You are correct I was processing my test suite which has a relatively large number (~14,000) of relatively small files (~1000 lines). I will play around with your suggestion.