-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
Our current data file, source-counts.csv
, uses a wide format with pipe-separated source combinations (e.g., Dimensions|Openalex
) and a count of DOIs. While this structure is sufficient for simple tabular views, we also need the same data in long format to generate a different portion of the Upset-style visualization
Proposal
Generate a new long-format CSV (source-counts-long-format.csv
) derived from source-counts.csv
that will support Tableau plotting. Each row in the new file should correspond to a single source in a given source combination.
Input: source-counts.csv
sources,count
Dimensions,18402
Dimensions|Openalex,228950
Dimensions|Openalex|PubMed,932
...
Output: source-counts-long-format.csv
group_id,source,position,count,group_position
1,Dimensions,1,18402,1
2,Dimensions,1,228950,2
2,Openalex,2,228950,2
3,Dimensions,1,932,3
3,Openalex,2,932,3
3,PubMed,3,932,3
...
Workflow Steps
-
Assign a unique
group_id
- Each row in the original file gets a unique ID (e.g., 1–31) to represent a single combination of sources.
-
Split the
sources
column- Use the pipe
|
delimiter to extract individual sources (e.g.,["Dimensions", "Openalex"]
).
- Use the pipe
-
Define a fixed horizontal source order
- Assign a consistent numeric
position
to each source name:- Dimensions = 1
- Openalex = 2
- PubMed = 3
- SUL-Pub = 4
- WoS = 5
- Assign a consistent numeric
-
Expand into long format
- For each source in a given group:
- Create a new row with:
group_id
: from Step 1source
: individual sourceposition
: based on fixed ordercount
: inherited from the original groupgroup_position
: same asgroup_id
, used for row alignment in Tableau
- Create a new row with:
- For each source in a given group:
-
Output the result to a new CSV
- Columns should be in this order:
group_id,source,position,count,group_position
- Columns should be in this order:
Here is a link to the actual file needed: https://drive.google.com/file/d/13SDgb2cF7uEtSFr9dCACep5SRHVVDIyh/view?usp=drive_link