Skip to content

A Go program that accumulates B1-level Dutch vocabulary using the [NT2Lex](https://github.com/anaistack/NT2Lex) database, a CEFR-graded lexical resource for Dutch as a Foreign Language

License

Notifications You must be signed in to change notification settings

dstotijn/nl-b1-wordlist

Repository files navigation

Dutch B1 Word List Generator

A Go program that accumulates B1-level Dutch vocabulary using the NT2Lex database, a CEFR-graded lexical resource for Dutch as a Foreign Language.

Overview

This program provides multiple ways to generate comprehensive lists of Dutch B1-level vocabulary:

  • NT2Lex Integration: Parse actual NT2Lex data files for scientifically validated B1 words
  • API Interface: Query NT2Lex online tools (when available)
  • Frequency Analysis: Filter words by usage frequency and complexity metrics
  • Multiple Export Formats: Generate word lists in JSON, text, and frequency-sorted formats

Features

Core Functionality

  • CEFR-Graded: Words are specifically classified as B1 level according to academic research
  • Linguistically Rich: Each word includes part-of-speech, frequency data, and complexity metrics
  • Multiple Input Formats: Support for CSV, TSV, and simple word list files
  • Frequency Filtering: Remove low-frequency words to focus on most useful vocabulary
  • Categorization: Group words by part of speech (nouns, verbs, adjectives, etc.)

Data Quality

  • Based on actual Dutch learning materials (textbooks, simplified readers)
  • Validated by academic research (Tack et al., 2018)
  • Includes 15,000+ lexical entries with detailed linguistic information

Installation

  1. Clone this repository:

    git clone <your-repo-url>
    cd nl-b1-wordlist
  2. Build the program:

    go build

Usage

Option 1: Using NT2Lex Data Files

  1. Get NT2Lex data:

    git clone https://github.com/anaistack/NT2Lex.git
  2. Parse the data files:

    // In your main.go, use the parser
    collector := NewB1WordCollector()
    parser := NewNT2LexParser(collector)
    
    // Parse CSV/TSV files from NT2Lex repository
    err := parser.ParseCSVFile("NT2Lex/resource/nt2lex_data.csv")

Option 2: Using Online NT2Lex Tools

Visit cental.uclouvain.be/nt2lex/ to:

  • Search individual words
  • Analyze text complexity
  • Export B1 word lists

Option 3: Run the Demo

go run .

This will:

  • Generate a sample B1 word list
  • Export to dutch_b1_words.json and dutch_b1_words.txt
  • Display statistics about the collected words

Program Structure

Main Components

// Core data structure for lexical entries
type NT2LexEntry struct {
    Word       string            // Canonical form (lemma)
    Tag        string            // Part of speech
    CEFRLevel  string            // CEFR level (B1)
    Statistics map[string]string // Frequency and complexity data
}

// Main collector for B1 words
type B1WordCollector struct {
    words      map[string]*NT2LexEntry
    // Methods for adding, filtering, and exporting words
}

// Parser for NT2Lex data files
type NT2LexParser struct {
    // Methods for parsing CSV, TSV, and word list files
}

Key Methods

  • AddWord(word string): Add a word if it's B1 level
  • ParseCSVFile(filename string): Parse NT2Lex CSV data
  • FilterByFrequency(minFreq int): Remove low-frequency words
  • ExportToJSON(filename string): Export as JSON
  • ExportToTextFile(filename string): Export as categorized text
  • GenerateWordFrequencyReport(filename string): Create frequency-sorted report

Output Formats

JSON Export (dutch_b1_words.json)

[
  {
    "word": "huis",
    "tag": "noun",
    "cefr_level": "B1",
    "statistics": {
      "f": "1250",
      "u": "850"
    }
  }
]

Text Export (dutch_b1_words.txt)

# Dutch B1 Vocabulary List
# Total words: 2847

## noun (1205 words)
auto
familie
geld
huis
school
tijd
werk

## verb (892 words)
beginnen
denken
eten
gaan
helpen
maken

## adjective (750 words)
goed
groot
klein
mooi

Frequency Report (frequency_report.txt)

Word    Frequency    Part of Speech
----    ---------    --------------
het     15420        determiner
een     12380        determiner
zijn    8940         verb
hebben  7530         verb

Data Sources

Primary: NT2Lex Database

  • Repository: github.com/anaistack/NT2Lex
  • Online Tools: cental.uclouvain.be/nt2lex/
  • Paper: Tack et al. (2018) - NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, 137-146.

Alternative Sources

  • Dutch coursebook vocabulary lists
  • Frequency dictionaries
  • Common topic-based word lists (housing, food, transport, etc.)

Research Background

NT2Lex is based on academic research that analyzed:

  • Dutch textbook reading activities
  • Simplified readers for Dutch learners
  • Frequency distributions across CEFR levels
  • Linguistic complexity metrics

The B1 level words represent vocabulary that intermediate Dutch learners should know for:

  • Expressing opinions and experiences
  • Describing plans and ambitions
  • Handling most everyday situations
  • Understanding main points of clear texts

Example: Processing NT2Lex Data

package main

import "fmt"

func main() {
    // Create collector
    collector := NewB1WordCollector()
    parser := NewNT2LexParser(collector)

    // Parse NT2Lex data
    if err := parser.ParseCSVFile("nt2lex_data.csv"); err != nil {
        fmt.Printf("Error: %v\n", err)
        return
    }

    // Filter by frequency (optional)
    parser.FilterByFrequency(100) // Keep words with freq >= 100

    // Generate reports
    collector.ExportToJSON("b1_words.json")
    collector.ExportToTextFile("b1_words.txt")
    parser.GenerateWordFrequencyReport("frequency_report.txt")

    // Display statistics
    collector.PrintStats()
}

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

The NT2Lex data is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

References

Tack, A., François, T., Desmet, P., & Fairon, C. (2018). NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, 137-146.

About

A Go program that accumulates B1-level Dutch vocabulary using the [NT2Lex](https://github.com/anaistack/NT2Lex) database, a CEFR-graded lexical resource for Dutch as a Foreign Language

Topics

Resources

License

Stars

Watchers

Forks