Skip to content

okasi/swedish-pii

Repository files navigation

🇸🇪 Swedish PII Detection

Detects Personal Identifiable Information (PII) in Swedish text


📦 Current Version

🔍 What Can Be Detected?

💳 Financial
  • American Express credit card numbers
  • Mastercard credit card numbers
  • Visa credit card numbers
  • Swedish IBAN codes
  • Swedish BIC codes
  • Swedish bank account numbers
🆔 Identification Numbers
  • Swedish personal identity numbers (male)
  • Swedish personal identity numbers (female)
  • Swedish coordination numbers (male)
  • Swedish coordination numbers (female)
📞 Contact
  • Email addresses
  • Phone numbers
  • Social media information
📍 Location
  • Swedish street addresses
  • Swedish postal codes
  • Swedish municipalities
  • Swedish counties
🏢 Work / Education
  • Swedish work organizations
  • Swedish education organizations
  • Swedish education programs
  • Swedish work professions
  • Swedish organization numbers
🔒 Sensitive Attributes
  • Marital status information
  • Genetic sex information
  • Disability information
  • Religion information
  • Sexual orientation information
  • Demographic information
  • Political ideologies information
🧩 Misc
  • Swedish license plate information
  • IP addresses
  • MAC addresses
  • Date information
  • Time information
👤 Names
  • Top 20,524 male first names with at least 10 bearers in Sweden 1999-2020
  • Top 23,347 female first names with at least 10 bearers in Sweden 1999-2020
  • Top 107,762 last names with at least 10 bearers in Sweden 1999-2020

🛣️ Roadmap

  • Patterns for:
    • Passport numbers
    • Residence Permit Number?
    • Bank Account Number (Bankgiro/Plusgiro)
  • Set lookups for:
  • Reduce false positives
  • Improve performance & simplify
  • Comprehensive tests
  • Make it to a npm package (library)
  • Make a documentation page (frontend)

📚 Dataset Sources


🗺️ Extracting Data from OpenStreetMap via Osmium

🛠️ Install Tools via Homebrew

brew install osmium-tool
brew install jq

🏞️ Extract Counties

osmium tags-filter sweden-latest.osm.pbf "nwr/admin_level=4" -o counties.osm.pbf && \
osmium export counties.osm.pbf -o counties.geojson && \
jq -r '
  [
    .features[]
    | select(
        .properties.admin_level == "4"
        and .properties.name != null
        and (.properties.name | test("län$"))
      )
    | .properties.name
  ]
  | sort
  | unique
' counties.geojson > counties.json && \
rm counties.osm.pbf counties.geojson

🏙️ Extract Cities

osmium tags-filter sweden-latest.osm.pbf "nwr/place=city" -o cities.osm.pbf && \
osmium export cities.osm.pbf -o cities.geojson && \
jq -r '
  [
    .features[]
    | select(
        .properties.name != null
        and (.properties.name | test("^[0-9]+$") | not)
        and (.properties.name | test("vägen$|kommun$") | not)
      )
    | .properties.name
  ]
  | sort
  | unique
' cities.geojson > cities.json && \
rm cities.osm.pbf cities.geojson

🏛️ Extract Municipalities

osmium tags-filter sweden-latest.osm.pbf nwr/boundary=administrative -o municipalities.osm.pbf && \
osmium export municipalities.osm.pbf -o municipalities.geojson && \
jq -r '
  .features
  | map(select(.properties.admin_level == "7" and .properties.name != null and .properties.name != "Svartån"))
  | map(.properties.name)
  | map(if endswith(" kommun") or . == "Göteborgs Stad" then . else . + " kommun" end)
  | sort
  | unique
' municipalities.geojson > municipalities.json && \
rm municipalities.osm.pbf municipalities.geojson

🏘️ Extract Suburbs & Neighborhoods

osmium tags-filter sweden-latest.osm.pbf "nwr/place=suburb" -o suburbs.osm.pbf && \
osmium tags-filter sweden-latest.osm.pbf "nwr/place=neighborhood" -o neighborhoods.osm.pbf && \
osmium tags-filter sweden-latest.osm.pbf "nwr/place=town" -o towns.osm.pbf && \
osmium merge suburbs.osm.pbf neighborhoods.osm.pbf towns.osm.pbf -o areas.osm.pbf && \
osmium export areas.osm.pbf -o areas.geojson && \
jq -r '
  [
    .features[]
    | select(
        .properties.name != null
        and (.properties.name | test("^[0-9]+$") | not)
      )
    | .properties.name
  ]
  | sort
  | unique
' areas.geojson > areas.json && \
rm suburbs.osm.pbf neighborhoods.osm.pbf towns.osm.pbf areas.osm.pbf areas.geojson

🚦 Extract Streets & Postalcodes

osmium tags-filter sweden-latest.osm.pbf 'addr:street=*' 'addr:postcode=*' -o addresses.osm.pbf && \
osmium export addresses.osm.pbf -o addresses.geojson && \
jq -r '
  .features
  | map(
      select(
        (.properties["addr:street"] != null) and
        (.properties["addr:postcode"] != null) and
        (.properties["addr:street"] | test("^[0-9]+$") | not)
      )
    )
  | map({
      street: .properties["addr:street"],
      postcode: (
        .properties["addr:postcode"]
        | gsub(" "; "")
        | if test("^[0-9]{5}$") then .[:3] + " " + .[3:] else . end
      ),
      housenumber: (
        (.properties["addr:housenumber"] // "") 
        | split(",")
        | map(gsub("^ +| +$"; ""))
      )
    })
  | group_by(.street)
  | map(
      {
        street: .[0].street,
        postalcodes: (map(.postcode) | unique),
        housenumbers: (map(.housenumber) | add | unique)
      }
    )
  | sort_by(.street)
' addresses.geojson > streets_postalcodes.json && \
rm addresses.osm.pbf addresses.geojson

🧪 RegEx Ideas

Releases

No releases published

Packages

No packages published

Languages