You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Empty json file for testing
* Added notes for large pickles #21
* Delete requirements.txt (#26); setup.py is sufficient; completes #25
* Corrected URL for word cloud & updated FDG section in README.md
Copy file name to clipboardExpand all lines: README.md
+29-2Lines changed: 29 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -27,11 +27,11 @@ The score here is derived from the term [tf-idf](https://en.wikipedia.org/wiki/T
27
27
28
28
### Word cloud
29
29
30
-
Here is a [wordcloud](https://github.com/datasciencecampus/patent_app_detect/output/wordclouds/wordcloud_tech.png) using the Y02 classification on a 10,000 random sample of patents. The greater the tf-idf score, the larger the font size of the term.
30
+
Here is a [wordcloud](https://raw.githubusercontent.com/datasciencecampus/patent_app_detect/master/outputs/wordclouds/wordcloud_tech.png) using the Y02 classification on a 10,000 random sample of patents. The greater the tf-idf score, the larger the font size of the term.
31
31
32
32
### Force directed graph
33
33
34
-
This output provides an [interactive graph](https://github.com/datasciencecampus/patent_app_detect/outputs/fdg/index.html) that shows connections between terms that are generally found in the same patent documents. This example was run for the Y02 classification on a 10,000 random sample of patents.
34
+
This output provides an interactive graph in the to be viewed in a web browser (you need to locally open the file ```outputs/fdg/index.html```). The graph shows connections between terms that are generally found in the same patent documents. The example wordcloud in the ```outputs/fdg``` folder was created using the Y02 classification on a 10,000 random sample of patents.
Will run the tool for a pre-created random dataset of 10,000 patents.
84
84
85
+
### Additional patent sources
86
+
87
+
Patent datasets are stored in the sub-folder ```data```, we have supplied the following files:
88
+
-```USPTO-random-100.pkl.bz2```
89
+
-```USPTO-random-1000.pkl.bz2```
90
+
-```USPTO-random-10000.pkl.bz2```
91
+
-```USPTO-random-100000.pkl.bz2```
92
+
-```USPTO-random-500000.pkl.bz2```
93
+
94
+
The command ```python detect.py -ps=USPTO-random-10000``` instructs the program to load a pickled data frame of patents
95
+
from a file located in ```data/USPTO-random-10000.pkl.bz2```. Hence ```-ps=NAME``` looks for ```data/NAME.pkl.bz2```.
96
+
97
+
We have hosted larger datasets on a google drive, as the files are too large for GitHub version control. We have made available:
98
+
- All USPTO patents from 2004 (477Mb): [USPTO-all.pkl.bz2](https://drive.google.com/drive/folders/1d47pizWdKqtORS1zoBzsk3tLk6VZZA4N)
99
+
100
+
To use additional files, follow the link and download the pickle file into the data folder. Access the new data
101
+
with ```-ps=NameWithoutFileExtension```; for example, ```USPTO-all.pkl.bz2``` would be loaded with ```-ps=USPTO-all```.
102
+
103
+
Note that large datasets will require a large amount of system memory (such as 64Gb), otherwise it will process very slowly
104
+
as virtual memory (swap) is very likely to be used.
105
+
85
106
### Choosing CPC classification
86
107
87
108
This subsets the chosen patents dataset to a particular Cooperative Patent Classification (CPC) class, for example Y02. The Y02 classification is for "technologies or applications for mitigation or adaptation against climate change". In this case a larger patent dataset is generally required to allow for the reduction in patent numbers after subsetting. An example script is:
@@ -216,3 +237,9 @@ optional arguments:
216
237
the desired cpc classification
217
238
218
239
```
240
+
241
+
## Acknowledgements
242
+
243
+
### Patent data
244
+
245
+
Patent data was obtained from the [United States Patent and Trademark Office (USPTO)](https://www.uspto.gov) through the [Bulk Data Storage System (BDSS)](https://bulkdata.uspto.gov). In particular we used the `Patent Grant Full Text Data/APS (JAN 1976 - PRESENT)` dataset, using the data from 2004 onwards in XML 4.* format.
0 commit comments