A Go program that accumulates B1-level Dutch vocabulary using the NT2Lex database, a CEFR-graded lexical resource for Dutch as a Foreign Language.
This program provides multiple ways to generate comprehensive lists of Dutch B1-level vocabulary:
- NT2Lex Integration: Parse actual NT2Lex data files for scientifically validated B1 words
- API Interface: Query NT2Lex online tools (when available)
- Frequency Analysis: Filter words by usage frequency and complexity metrics
- Multiple Export Formats: Generate word lists in JSON, text, and frequency-sorted formats
- CEFR-Graded: Words are specifically classified as B1 level according to academic research
- Linguistically Rich: Each word includes part-of-speech, frequency data, and complexity metrics
- Multiple Input Formats: Support for CSV, TSV, and simple word list files
- Frequency Filtering: Remove low-frequency words to focus on most useful vocabulary
- Categorization: Group words by part of speech (nouns, verbs, adjectives, etc.)
- Based on actual Dutch learning materials (textbooks, simplified readers)
- Validated by academic research (Tack et al., 2018)
- Includes 15,000+ lexical entries with detailed linguistic information
-
Clone this repository:
git clone <your-repo-url> cd nl-b1-wordlist
-
Build the program:
go build
-
Get NT2Lex data:
git clone https://github.com/anaistack/NT2Lex.git
-
Parse the data files:
// In your main.go, use the parser collector := NewB1WordCollector() parser := NewNT2LexParser(collector) // Parse CSV/TSV files from NT2Lex repository err := parser.ParseCSVFile("NT2Lex/resource/nt2lex_data.csv")
Visit cental.uclouvain.be/nt2lex/ to:
- Search individual words
- Analyze text complexity
- Export B1 word lists
go run .
This will:
- Generate a sample B1 word list
- Export to
dutch_b1_words.json
anddutch_b1_words.txt
- Display statistics about the collected words
// Core data structure for lexical entries
type NT2LexEntry struct {
Word string // Canonical form (lemma)
Tag string // Part of speech
CEFRLevel string // CEFR level (B1)
Statistics map[string]string // Frequency and complexity data
}
// Main collector for B1 words
type B1WordCollector struct {
words map[string]*NT2LexEntry
// Methods for adding, filtering, and exporting words
}
// Parser for NT2Lex data files
type NT2LexParser struct {
// Methods for parsing CSV, TSV, and word list files
}
AddWord(word string)
: Add a word if it's B1 levelParseCSVFile(filename string)
: Parse NT2Lex CSV dataFilterByFrequency(minFreq int)
: Remove low-frequency wordsExportToJSON(filename string)
: Export as JSONExportToTextFile(filename string)
: Export as categorized textGenerateWordFrequencyReport(filename string)
: Create frequency-sorted report
[
{
"word": "huis",
"tag": "noun",
"cefr_level": "B1",
"statistics": {
"f": "1250",
"u": "850"
}
}
]
# Dutch B1 Vocabulary List
# Total words: 2847
## noun (1205 words)
auto
familie
geld
huis
school
tijd
werk
## verb (892 words)
beginnen
denken
eten
gaan
helpen
maken
## adjective (750 words)
goed
groot
klein
mooi
Word Frequency Part of Speech
---- --------- --------------
het 15420 determiner
een 12380 determiner
zijn 8940 verb
hebben 7530 verb
- Repository: github.com/anaistack/NT2Lex
- Online Tools: cental.uclouvain.be/nt2lex/
- Paper: Tack et al. (2018) - NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, 137-146.
- Dutch coursebook vocabulary lists
- Frequency dictionaries
- Common topic-based word lists (housing, food, transport, etc.)
NT2Lex is based on academic research that analyzed:
- Dutch textbook reading activities
- Simplified readers for Dutch learners
- Frequency distributions across CEFR levels
- Linguistic complexity metrics
The B1 level words represent vocabulary that intermediate Dutch learners should know for:
- Expressing opinions and experiences
- Describing plans and ambitions
- Handling most everyday situations
- Understanding main points of clear texts
package main
import "fmt"
func main() {
// Create collector
collector := NewB1WordCollector()
parser := NewNT2LexParser(collector)
// Parse NT2Lex data
if err := parser.ParseCSVFile("nt2lex_data.csv"); err != nil {
fmt.Printf("Error: %v\n", err)
return
}
// Filter by frequency (optional)
parser.FilterByFrequency(100) // Keep words with freq >= 100
// Generate reports
collector.ExportToJSON("b1_words.json")
collector.ExportToTextFile("b1_words.txt")
parser.GenerateWordFrequencyReport("frequency_report.txt")
// Display statistics
collector.PrintStats()
}
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
The NT2Lex data is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Tack, A., François, T., Desmet, P., & Fairon, C. (2018). NT2Lex: A CEFR-Graded Lexical Resource for Dutch as a Foreign Language Linked to Open Dutch WordNet. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, 137-146.