Skip to content

Commit d34a122

Browse files
authored
v0.9.0 (#27)
* Empty json file for testing * Added notes for large pickles #21 * Delete requirements.txt (#26); setup.py is sufficient; completes #25 * Corrected URL for word cloud & updated FDG section in README.md
1 parent b8a511a commit d34a122

File tree

10 files changed

+40
-44
lines changed

10 files changed

+40
-44
lines changed

README.md

Lines changed: 29 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,11 @@ The score here is derived from the term [tf-idf](https://en.wikipedia.org/wiki/T
2727

2828
### Word cloud
2929

30-
Here is a [wordcloud](https://github.com/datasciencecampus/patent_app_detect/output/wordclouds/wordcloud_tech.png) using the Y02 classification on a 10,000 random sample of patents. The greater the tf-idf score, the larger the font size of the term.
30+
Here is a [wordcloud](https://raw.githubusercontent.com/datasciencecampus/patent_app_detect/master/outputs/wordclouds/wordcloud_tech.png) using the Y02 classification on a 10,000 random sample of patents. The greater the tf-idf score, the larger the font size of the term.
3131

3232
### Force directed graph
3333

34-
This output provides an [interactive graph](https://github.com/datasciencecampus/patent_app_detect/outputs/fdg/index.html) that shows connections between terms that are generally found in the same patent documents. This example was run for the Y02 classification on a 10,000 random sample of patents.
34+
This output provides an interactive graph in the to be viewed in a web browser (you need to locally open the file ```outputs/fdg/index.html```). The graph shows connections between terms that are generally found in the same patent documents. The example wordcloud in the ```outputs/fdg``` folder was created using the Y02 classification on a 10,000 random sample of patents.
3535

3636
## How to install
3737

@@ -82,6 +82,27 @@ python detect.py -ps=USPTO-random-10000
8282

8383
Will run the tool for a pre-created random dataset of 10,000 patents.
8484

85+
### Additional patent sources
86+
87+
Patent datasets are stored in the sub-folder ```data```, we have supplied the following files:
88+
- ```USPTO-random-100.pkl.bz2```
89+
- ```USPTO-random-1000.pkl.bz2```
90+
- ```USPTO-random-10000.pkl.bz2```
91+
- ```USPTO-random-100000.pkl.bz2```
92+
- ```USPTO-random-500000.pkl.bz2```
93+
94+
The command ```python detect.py -ps=USPTO-random-10000``` instructs the program to load a pickled data frame of patents
95+
from a file located in ```data/USPTO-random-10000.pkl.bz2```. Hence ```-ps=NAME``` looks for ```data/NAME.pkl.bz2```.
96+
97+
We have hosted larger datasets on a google drive, as the files are too large for GitHub version control. We have made available:
98+
- All USPTO patents from 2004 (477Mb): [USPTO-all.pkl.bz2](https://drive.google.com/drive/folders/1d47pizWdKqtORS1zoBzsk3tLk6VZZA4N)
99+
100+
To use additional files, follow the link and download the pickle file into the data folder. Access the new data
101+
with ```-ps=NameWithoutFileExtension```; for example, ```USPTO-all.pkl.bz2``` would be loaded with ```-ps=USPTO-all```.
102+
103+
Note that large datasets will require a large amount of system memory (such as 64Gb), otherwise it will process very slowly
104+
as virtual memory (swap) is very likely to be used.
105+
85106
### Choosing CPC classification
86107

87108
This subsets the chosen patents dataset to a particular Cooperative Patent Classification (CPC) class, for example Y02. The Y02 classification is for "technologies or applications for mitigation or adaptation against climate change". In this case a larger patent dataset is generally required to allow for the reduction in patent numbers after subsetting. An example script is:
@@ -216,3 +237,9 @@ optional arguments:
216237
the desired cpc classification
217238
218239
```
240+
241+
## Acknowledgements
242+
243+
### Patent data
244+
245+
Patent data was obtained from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov) through the [Bulk Data Storage System (BDSS)](https://bulkdata.uspto.gov). In particular we used the `Patent Grant Full Text Data/APS (JAN 1976 - PRESENT)` dataset, using the data from 2004 onwards in XML 4.* format.

detect.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ def get_tfidf(args, filename, cpc):
109109

110110

111111
def main():
112-
paths = [os.path.join('outputs', 'reports'), os.path.join('outputs', 'json'), os.path.join('outputs', 'wordclouds')]
112+
paths = [os.path.join('outputs', 'reports'), os.path.join('outputs', 'wordclouds')]
113113
for path in paths:
114114
os.makedirs(path, exist_ok=True)
115115

outputs/fdg/empty.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
[]

outputs/fdg/f.js

Lines changed: 1 addition & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,7 @@
1-
var dataURL = "http://mysafeinfo.com/api/data?list=englishmonarchs&format=json";
2-
1+
var dataURL = "https://raw.githubusercontent.com/datasciencecampus/patent_app_detect/master/outputs/fdg/empty.json";
32

43
var refresh = function(data){
54

6-
7-
85
var json_obj = JSON.parse(data);
96
var svg = d3.select("svg"),
107
width = +svg.attr("width"),
@@ -25,40 +22,20 @@ d3.select("div#chartId")
2522
//class to make it responsive
2623
.classed("svg-content-responsive", true);
2724

28-
//var container = d3.select('body').append('div')
29-
// .attr('id','container')
30-
//;
31-
//
32-
//// svg#sky
33-
//var sky = container.append('svg')
34-
// //.attr('height', 100)
35-
// //.attr('width', 100)
36-
// .attr('id', 'sky')
37-
//;
38-
3925
var color = d3.scaleOrdinal(d3.schemeCategory20c);
40-
//var nodeRadius = 20;
4126

4227
var padding = 1, // separation between circles
4328
radius=6;
4429

45-
46-
4730
var simulation = d3.forceSimulation()
4831
.force("link", d3.forceLink().id(function(d) {
4932
return d.text;
5033
}).distance(300))
5134
.force("charge", d3.forceManyBody().strength(-100))
5235
.force("center", d3.forceCenter(width / 2, height / 2))
53-
//.force("gravity", 0.05)
54-
//.force("linkDistance", 50)
55-
//.force("size", [9000, 6000])
5636
.force("collide", d3.forceCollide().radius(function(d) {
5737
return 12*radius + padding; }).iterations(40))
5838

59-
60-
61-
6239
d3.json(dataURL, function(error, graph) {
6340
if (error) throw error;
6441

@@ -72,7 +49,6 @@ d3.json(dataURL, function(error, graph) {
7249
.data(graph.links)
7350
.enter().append("line").attr("stroke-width", function(d) {
7451
return (8*d.size);
75-
//Math.sqrt(1.5*d.size);
7652
});
7753

7854
var node = svg.append("g")
@@ -119,7 +95,6 @@ d3.json(dataURL, function(error, graph) {
11995
.text(function(d) {
12096
return d.text
12197
});
122-
12398

12499
simulation
125100
.nodes(graph.nodes)

outputs/fdg/index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,5 @@
44
<link rel="stylesheet" href="fdg_style.css"/>
55
<script src="https://d3js.org/d3.v4.min.js"></script>
66
<script src="knockout-3.4.2.js"></script>
7-
<script type="text/javascript" src="key-terms.json"></script>
7+
<script src="key-terms.js"></script>
88
<script src="f.js"></script>

outputs/fdg/key-terms.js

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

outputs/fdg/key-terms.json

Lines changed: 0 additions & 1 deletion
This file was deleted.

requirements.txt

Lines changed: 0 additions & 8 deletions
This file was deleted.

0 commit comments

Comments
 (0)