Skip to content

Commit b8a511a

Browse files
authored
First release attempt (#18)
v0.1 pre-release * Minimal code (#2) * Setting up github... added requirements.txt to enable dependency tree * First CI test (#3) * Minimal set of files to get tests passing #1 * Config to trigger travis * Remaining code (#7) * Uses setup.py (#10) * Corrected license * bug: backend matplotlib so that it works with Pycharm. Fixes issue #12. (#13) * feat: now shows the number of patents analysed for cpc classification * feat: updated ReadME. Uploaded outputs for ReadME. Also moved fdg outputs to outputs/fdg folder not fdg folder in root directory (cleaner) * Experimenting with code coverage #9 (#17)
1 parent 378c8cd commit b8a511a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+3233
-4
lines changed

.coveragerc

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
[run]
2+
branch = True
3+
source = scripts
4+
5+
[report]
6+
exclude_lines =
7+
if self.debug:
8+
pragma: no cover
9+
raise NotImplementedError
10+
if __name__ == .__main__.:
11+
ignore_errors = True
12+
omit =
13+
tests/*

.github/ISSUE_TEMPLATE/bug_report.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
---
2+
name: Bug report
3+
about: Create a report to help us improve
4+
5+
---
6+
7+
**Describe the bug**
8+
A clear and concise description of what the bug is.
9+
10+
**To Reproduce**
11+
Steps to reproduce the behavior:
12+
1. Go to '...'
13+
2. Click on '....'
14+
3. Scroll down to '....'
15+
4. See error
16+
17+
**Expected behavior**
18+
A clear and concise description of what you expected to happen.
19+
20+
**Screenshots**
21+
If applicable, add screenshots to help explain your problem.
22+
23+
**Desktop (please complete the following information):**
24+
- OS: [e.g. iOS]
25+
- Browser [e.g. chrome, safari]
26+
- Version [e.g. 22]
27+
28+
**Smartphone (please complete the following information):**
29+
- Device: [e.g. iPhone6]
30+
- OS: [e.g. iOS8.1]
31+
- Browser [e.g. stock browser, safari]
32+
- Version [e.g. 22]
33+
34+
**Additional context**
35+
Add any other context about the problem here.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
name: Feature request
3+
about: Suggest an idea for this project
4+
5+
---
6+
7+
**Is your feature request related to a problem? Please describe.**
8+
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
9+
10+
**Describe the solution you'd like**
11+
A clear and concise description of what you want to happen.
12+
13+
**Describe alternatives you've considered**
14+
A clear and concise description of any alternative solutions or features you've considered.
15+
16+
**Additional context**
17+
Add any other context or screenshots about the feature request here.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,3 +102,6 @@ venv.bak/
102102

103103
# mypy
104104
.mypy_cache/
105+
106+
# PyCharm
107+
.idea

.travis.yml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
language: python
2+
python:
3+
- "3.6"
4+
5+
install:
6+
# command to install dependencies
7+
- python setup.py install
8+
# also need to download punkt tokeniser data
9+
- travis_wait 30 python -m nltk.downloader punkt
10+
11+
script:
12+
# for codecov support
13+
- pip install pytest pytest-cov
14+
# command to run tests
15+
- pytest --cov=./
16+
17+
after_success:
18+
- bash <(curl -s https://codecov.io/bash)

LICENSE

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
The Open Government Licence (OGL) Version 3
2+
3+
Copyright (c) 2018 Office of National Statistics
4+
5+
This source code is licensed under the Open Government Licence v3.0. To view this
6+
licence, visit www.nationalarchives.gov.uk/doc/open-government-licence/version/3
7+
or write to the Information Policy Team, The National Archives, Kew, Richmond,
8+
Surrey, TW9 4DU.

README.md

Lines changed: 218 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,218 @@
1-
# patent_app_detect
2-
Derives popular terminology included within a particular patent technology area (CPC classification), based on text analysis of patent abstract information
1+
[![build status](http://img.shields.io/travis/datasciencecampus/patent_app_detect/master.svg?style=flat)](https://travis-ci.org/datasciencecampus/patent_app_detect)
2+
[![codecov](https://codecov.io/gh/datasciencecampus/patent_app_detect/branch/master/graph/badge.svg)](https://codecov.io/gh/datasciencecampus/patent_app_detect)
3+
[![LICENSE.](https://img.shields.io/badge/license-OGL--3-blue.svg?style=flat)](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/)
4+
5+
# patent_app_detect
6+
7+
## Description of tool
8+
9+
The tool is designed to derive popular terminology included within a particular patent technology area ([CPC classification](https://www.epo.org/searching-for-patents/helpful-resources/first-time-here/classification/cpc.html)), based on text analysis of patent abstract information. If the tool is targeted at the [Y02 classification](https://www.epo.org/news-issues/issues/classification/classification.html), for example, identified terms could include 'fuel cell' and 'heat exchanger'. A number of options are provided, for example to provide report, word cloud or graphical output. Some example outputs are shown below:
10+
11+
### Report
12+
13+
The score here is derived from the term [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) values using the Y02 classification on a 10,000 random sample of patents. The terms are all bigrams in this example.
14+
15+
|Term | TF-IDF Score |
16+
| :------------------------ | -------------------:|
17+
|1. fuel cell | 2.143778 |
18+
|2. heat exchanger | 1.697166 |
19+
|3. exhaust gas | 1.496812 |
20+
|4. combustion engine | 1.480615 |
21+
|5. combustion chamber | 1.390726 |
22+
|6. energy storage | 1.302651 |
23+
|7. internal combustion | 1.108040 |
24+
|8. positive electrode | 1.100686 |
25+
|9. carbon dioxide | 1.092638 |
26+
|10. control unit | 1.069478 |
27+
28+
### Word cloud
29+
30+
Here is a [wordcloud](https://github.com/datasciencecampus/patent_app_detect/output/wordclouds/wordcloud_tech.png) using the Y02 classification on a 10,000 random sample of patents. The greater the tf-idf score, the larger the font size of the term.
31+
32+
### Force directed graph
33+
34+
This output provides an [interactive graph](https://github.com/datasciencecampus/patent_app_detect/outputs/fdg/index.html) that shows connections between terms that are generally found in the same patent documents. This example was run for the Y02 classification on a 10,000 random sample of patents.
35+
36+
## How to install
37+
38+
The tool has been developed to work on both Windows and MacOS. To install:
39+
40+
1. Please make sure Python 3.6 is installed and set at your path.
41+
It can be installed from [this location](https://www.python.org/downloads/release/python-360/) selecting the *relevant installer for your opearing system*. When prompted, please check the box to set the paths and environment variables for you and you should be ready to go. Python can also be installed as part of Anaconda [here](https://www.anaconda.com/download/#macos).
42+
43+
To check the Python version default for your system, run the following in command line/terminal:
44+
45+
```
46+
python --version
47+
```
48+
49+
**_Note_**: If Python 2 is the default Python version, but you have installed Python 3.6, your path may be setup to use `python3` instead of `python`.
50+
51+
2. To install the packages and dependencies for the tool, from the root directory (patent_app_detect) run:
52+
```
53+
pip install -e .
54+
```
55+
This will install all the libraries and run some tests. If the tests pass, the app is ready to run. If any of the tests fail, please email thanasis.anthopoulos@ons.gov.uk or ian.grimstead@ons.gov.uk
56+
with a screenshot of the failure and we will get back to you.
57+
58+
## How to use
59+
60+
The program is command line driven, and called in the following manner:
61+
62+
```
63+
python detect.py
64+
```
65+
66+
The above produces a default report output of top ranked terms, using default parameters. Additional command line arguments provide alternative options, for example a word cloud or force directed graph (fdg) output. The option 'all', produces all three:
67+
68+
```
69+
python detect.py -o='report' (using just `python detect.py` defaults to this option)
70+
python detect.py -o='wordcloud'
71+
python detect.py -o='fdg'
72+
python detect.py -o='all'
73+
```
74+
75+
### Choosing patent source
76+
77+
This selects the set of patents for use during analysis. The default source is a pre-created random 1,000 patent dataset from the USPTO, `USPTO-random-1000`. Pre-created datasets of 100, 1,000, 10,000, 100,000, and 500,000 patents are available in the `./data` folder. For example using:
78+
79+
```
80+
python detect.py -ps=USPTO-random-10000
81+
```
82+
83+
Will run the tool for a pre-created random dataset of 10,000 patents.
84+
85+
### Choosing CPC classification
86+
87+
This subsets the chosen patents dataset to a particular Cooperative Patent Classification (CPC) class, for example Y02. The Y02 classification is for "technologies or applications for mitigation or adaptation against climate change". In this case a larger patent dataset is generally required to allow for the reduction in patent numbers after subsetting. An example script is:
88+
89+
```
90+
python detect.py -cpc=Y02 -ps=USPTO-random-10000
91+
```
92+
93+
In the console the number of subset patents will be stated. For example, for `python detect.py -cpc=Y02 -ps=USPTO-random-10000` the number of Y02 patents is 197. Thus, the tf-idf will be run for 197 patents.
94+
95+
96+
### Term n-gram limits
97+
98+
Terms identified may be unigrams, bigrams, or trigrams. The following arguments set the ngram limits for 2-3 word terms (which are the default values).
99+
```
100+
python detect.py -mn=2 -mx=3
101+
```
102+
103+
### Time limits
104+
This will restrict the patents cohort to only those from 2000 up to now.
105+
106+
```
107+
python detect.py -yf=2000
108+
```
109+
110+
This will restrict the patents cohort to only those between 2000 - 2016.
111+
112+
```
113+
python detect.py -yf=2000 -yt=2016
114+
```
115+
### Time weighting
116+
117+
This option applies a linear weight that starts from 0.01 and ends at 1 between the time limits.
118+
```
119+
python detect.py -t
120+
```
121+
122+
### Citation weighting
123+
124+
This will weight the term tfidf scores by the number of citations each patent has. The weight is a normalised value between 0 and 1 with the higher the number indicating a higher number of citations.
125+
126+
```
127+
python detect.py -c
128+
```
129+
130+
### Term focus
131+
132+
This option utilises a second random patent dataset, by default `USPTO-random-10000`, whose terms are discounted from the chosen CPC classification to try and 'focus' the identified terms away from terms found more generally in the patent dataset. An example of choosing a larger
133+
134+
```
135+
python detect.py -f
136+
```
137+
138+
### Choose focus source
139+
140+
This selects the set of patents for use during the term focus option, for example for a larger dataset.
141+
142+
```
143+
python detect.py -fs=USPTO-random-100000
144+
```
145+
146+
### Config files
147+
148+
There are three configuration files available inside the config directory:
149+
150+
- stopwords_glob.txt
151+
- stopwords_n.txt
152+
- stopwords_uni.txt
153+
154+
The first file (stopwords_glob.txt) contains stopwords that are applied to all ngrams.
155+
The second file contains stopwords that are applied to all n-grams for n>1 and the last file (stopwords_uni.txt) contain stopwords that apply only to unigrams. The users can append stopwords into this files, to stop undesirable output terms.
156+
157+
## Help
158+
159+
A help function details the range and usage of these command line arguments:
160+
```
161+
python detect.py -h
162+
```
163+
164+
An edited version of the help output is included below. This starts with a summary of arguments:
165+
166+
```
167+
python detect.py -h
168+
usage: detect.py [-h] [-f] [-c] [-t] [-p {median,max,sum,avg}]
169+
[-o {fdg,wordcloud,report,all}] [-yf YEAR_FROM] [-yt YEAR_TO]
170+
[-np NUM_NGRAMS_REPORT] [-nd NUM_NGRAMS_WORDCLOUD]
171+
[-nf NUM_NGRAMS_FDG] [-ps PATENT_SOURCE] [-fs FOCUS_SOURCE]
172+
[-mn {1,2,3}] [-mx {1,2,3}] [-rn REPORT_NAME]
173+
[-wn WORDCLOUD_NAME] [-wt WORDCLOUD_TITLE]
174+
[-cpc CPC_CLASSIFICATION]
175+
176+
create report, wordcloud, and fdg graph for patent texts
177+
178+
```
179+
It continues with a detailed description of the arguments:
180+
```
181+
optional arguments:
182+
-h, --help show this help message and exit
183+
-f, --focus clean output from terms that appear in general
184+
-c, --cite weight terms by citations
185+
-t, --time weight terms by time
186+
-p {median,max,sum,avg}, --pick {median,max,sum,avg}
187+
options are <median> <max> <sum> <avg> defaults to
188+
sum. Average is over non zero values
189+
-o {fdg,wordcloud,report,all}, --output {fdg,wordcloud,report,all}
190+
options are: <fdg> <wordcloud> <report> <all>
191+
-yf YEAR_FROM, --year_from YEAR_FROM
192+
The first year for the patent cohort
193+
-yt YEAR_TO, --year_to YEAR_TO
194+
The last year for the patent cohort (0 is now)
195+
-np NUM_NGRAMS_REPORT, --num_ngrams_report NUM_NGRAMS_REPORT
196+
number of ngrams to return for report
197+
-nd NUM_NGRAMS_WORDCLOUD, --num_ngrams_wordcloud NUM_NGRAMS_WORDCLOUD
198+
number of ngrams to return for wordcloud
199+
-nf NUM_NGRAMS_FDG, --num_ngrams_fdg NUM_NGRAMS_FDG
200+
number of ngrams to return for fdg graph
201+
-ps PATENT_SOURCE, --patent_source PATENT_SOURCE
202+
the patent source to process
203+
-fs FOCUS_SOURCE, --focus_source FOCUS_SOURCE
204+
the patent source for the focus function
205+
-mn {1,2,3}, --min_n {1,2,3}
206+
the minimum ngram value
207+
-mx {1,2,3}, --max_n {1,2,3}
208+
the maximum ngram value
209+
-rn REPORT_NAME, --report_name REPORT_NAME
210+
report filename
211+
-wn WORDCLOUD_NAME, --wordcloud_name WORDCLOUD_NAME
212+
wordcloud filename
213+
-wt WORDCLOUD_TITLE, --wordcloud_title WORDCLOUD_TITLE
214+
wordcloud title
215+
-cpc CPC_CLASSIFICATION, --cpc_classification CPC_CLASSIFICATION
216+
the desired cpc classification
217+
218+
```

0 commit comments

Comments
 (0)