Skip to content

Generate long-format CSV for Upset-style Tableau visualization for Data Quality Dashbaord #444

@jacobthill

Description

@jacobthill

Problem

Our current data file, source-counts.csv, uses a wide format with pipe-separated source combinations (e.g., Dimensions|Openalex) and a count of DOIs. While this structure is sufficient for simple tabular views, we also need the same data in long format to generate a different portion of the Upset-style visualization

Proposal

Generate a new long-format CSV (source-counts-long-format.csv) derived from source-counts.csv that will support Tableau plotting. Each row in the new file should correspond to a single source in a given source combination.


Input: source-counts.csv

sources,count
Dimensions,18402
Dimensions|Openalex,228950
Dimensions|Openalex|PubMed,932
...

Output: source-counts-long-format.csv

group_id,source,position,count,group_position
1,Dimensions,1,18402,1
2,Dimensions,1,228950,2
2,Openalex,2,228950,2
3,Dimensions,1,932,3
3,Openalex,2,932,3
3,PubMed,3,932,3
...

Workflow Steps

  1. Assign a unique group_id

    • Each row in the original file gets a unique ID (e.g., 1–31) to represent a single combination of sources.
  2. Split the sources column

    • Use the pipe | delimiter to extract individual sources (e.g., ["Dimensions", "Openalex"]).
  3. Define a fixed horizontal source order

    • Assign a consistent numeric position to each source name:
      • Dimensions = 1
      • Openalex = 2
      • PubMed = 3
      • SUL-Pub = 4
      • WoS = 5
  4. Expand into long format

    • For each source in a given group:
      • Create a new row with:
        • group_id: from Step 1
        • source: individual source
        • position: based on fixed order
        • count: inherited from the original group
        • group_position: same as group_id, used for row alignment in Tableau
  5. Output the result to a new CSV

    • Columns should be in this order:
      group_id,source,position,count,group_position
      

Here is a link to the actual file needed: https://drive.google.com/file/d/13SDgb2cF7uEtSFr9dCACep5SRHVVDIyh/view?usp=drive_link

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions