Dutch B1 Word List Generator

A Go program that accumulates B1-level Dutch vocabulary using the NT2Lex database, a CEFR-graded lexical resource for Dutch as a Foreign Language.

Overview

This program provides multiple ways to generate comprehensive lists of Dutch B1-level vocabulary:

NT2Lex Integration: Parse actual NT2Lex data files for scientifically validated B1 words
API Interface: Query NT2Lex online tools (when available)
Frequency Analysis: Filter words by usage frequency and complexity metrics
Multiple Export Formats: Generate word lists in JSON, text, and frequency-sorted formats

Features

Core Functionality

CEFR-Graded: Words are specifically classified as B1 level according to academic research
Linguistically Rich: Each word includes part-of-speech, frequency data, and complexity metrics
Multiple Input Formats: Support for CSV, TSV, and simple word list files
Frequency Filtering: Remove low-frequency words to focus on most useful vocabulary
Categorization: Group words by part of speech (nouns, verbs, adjectives, etc.)

Data Quality

Based on actual Dutch learning materials (textbooks, simplified readers)
Validated by academic research (Tack et al., 2018)
Includes 15,000+ lexical entries with detailed linguistic information

Installation

Clone this repository:

git clone <your-repo-url>
cd nl-b1-wordlist

Build the program:
```
go build
```

Usage

Option 1: Using NT2Lex Data Files

Get NT2Lex data:

git clone https://github.com/anaistack/NT2Lex.git

Parse the data files:

// In your main.go, use the parser
collector := NewB1WordCollector()
parser := NewNT2LexParser(collector)

// Parse CSV/TSV files from NT2Lex repository
err := parser.ParseCSVFile("NT2Lex/resource/nt2lex_data.csv")

Option 2: Using Online NT2Lex Tools

Visit cental.uclouvain.be/nt2lex/ to:

Search individual words
Analyze text complexity
Export B1 word lists

Option 3: Run the Demo

go run .

This will:

Generate a sample B1 word list
Export to dutch_b1_words.json and dutch_b1_words.txt
Display statistics about the collected words

Program Structure

Main Components

// Core data structure for lexical entries
type NT2LexEntry struct {
    Word       string            // Canonical form (lemma)
    Tag        string            // Part of speech
    CEFRLevel  string            // CEFR level (B1)
    Statistics map[string]string // Frequency and complexity data
}

// Main collector for B1 words
type B1WordCollector struct {
    words      map[string]*NT2LexEntry
    // Methods for adding, filtering, and exporting words
}

// Parser for NT2Lex data files
type NT2LexParser struct {
    // Methods for parsing CSV, TSV, and word list files
}

Key Methods

AddWord(word string): Add a word if it's B1 level
ParseCSVFile(filename string): Parse NT2Lex CSV data
FilterByFrequency(minFreq int): Remove low-frequency words
ExportToJSON(filename string): Export as JSON
ExportToTextFile(filename string): Export as categorized text
GenerateWordFrequencyReport(filename string): Create frequency-sorted report

Output Formats

JSON Export (`dutch_b1_words.json`)

[
  {
    "word": "huis",
    "tag": "noun",
    "cefr_level": "B1",
    "statistics": {
      "f": "1250",
      "u": "850"
    }
  }
]

Text Export (`dutch_b1_words.txt`)

# Dutch B1 Vocabulary List
# Total words: 2847

## noun (1205 words)
auto
familie
geld
huis
school
tijd
werk

## verb (892 words)
beginnen
denken
eten
gaan
helpen
maken

## adjective (750 words)
goed
groot
klein
mooi

Frequency Report (`frequency_report.txt`)

Word    Frequency    Part of Speech
----    ---------    --------------
het     15420        determiner
een     12380        determiner
zijn    8940         verb
hebben  7530         verb

Data Sources

Primary: NT2Lex Database

Repository: github.com/anaistack/NT2Lex
Online Tools: cental.uclouvain.be/nt2lex/
Paper: Tack et al. (2018) - NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, 137-146.

Alternative Sources

Dutch coursebook vocabulary lists
Frequency dictionaries
Common topic-based word lists (housing, food, transport, etc.)

Research Background

NT2Lex is based on academic research that analyzed:

Dutch textbook reading activities
Simplified readers for Dutch learners
Frequency distributions across CEFR levels
Linguistic complexity metrics

The B1 level words represent vocabulary that intermediate Dutch learners should know for:

Expressing opinions and experiences
Describing plans and ambitions
Handling most everyday situations
Understanding main points of clear texts

Example: Processing NT2Lex Data

package main

import "fmt"

func main() {
    // Create collector
    collector := NewB1WordCollector()
    parser := NewNT2LexParser(collector)

    // Parse NT2Lex data
    if err := parser.ParseCSVFile("nt2lex_data.csv"); err != nil {
        fmt.Printf("Error: %v\n", err)
        return
    }

    // Filter by frequency (optional)
    parser.FilterByFrequency(100) // Keep words with freq >= 100

    // Generate reports
    collector.ExportToJSON("b1_words.json")
    collector.ExportToTextFile("b1_words.txt")
    parser.GenerateWordFrequencyReport("frequency_report.txt")

    // Display statistics
    collector.PrintStats()
}

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

The NT2Lex data is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

References

Tack, A., François, T., Desmet, P., & Fairon, C. (2018). NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, 137-146.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
b1_frequency_report.txt		b1_frequency_report.txt
dutch_b1_words.json		dutch_b1_words.json
dutch_b1_words.txt		dutch_b1_words.txt
get_nt2lex_data.sh		get_nt2lex_data.sh
go.mod		go.mod
main.go		main.go
nt2lex_parser.go		nt2lex_parser.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dutch B1 Word List Generator

Overview

Features

Core Functionality

Data Quality

Installation

Usage

Option 1: Using NT2Lex Data Files

Option 2: Using Online NT2Lex Tools

Option 3: Run the Demo

Program Structure

Main Components

Key Methods

Output Formats

JSON Export (`dutch_b1_words.json`)

Text Export (`dutch_b1_words.txt`)

Frequency Report (`frequency_report.txt`)

Data Sources

Primary: NT2Lex Database

Alternative Sources

Research Background

Example: Processing NT2Lex Data

Contributing

License

References

About

Uh oh!

Languages

License

dstotijn/nl-b1-wordlist

Folders and files

Latest commit

History

Repository files navigation

Dutch B1 Word List Generator

Overview

Features

Core Functionality

Data Quality

Installation

Usage

Option 1: Using NT2Lex Data Files

Option 2: Using Online NT2Lex Tools

Option 3: Run the Demo

Program Structure

Main Components

Key Methods

Output Formats

JSON Export (dutch_b1_words.json)

Text Export (dutch_b1_words.txt)

Frequency Report (frequency_report.txt)

Data Sources

Primary: NT2Lex Database

Alternative Sources

Research Background

Example: Processing NT2Lex Data

Contributing

License

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

JSON Export (`dutch_b1_words.json`)

Text Export (`dutch_b1_words.txt`)

Frequency Report (`frequency_report.txt`)