Skip to content

Commit 8963870

Browse files
author
Chip Nguyen
committed
v1.0.0
0 parents  commit 8963870

14 files changed

+1107
-0
lines changed

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
__pycache__
2+
*/__pycache__
3+
build/*
4+
dist/*
5+
lazynlp.egg-info/*

MANIFEST.in

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
include README.md MANIFEST.in
2+
include setup.py
3+
recursive-include . *.txt
4+
include lazynlp/*.txt

README.md

Lines changed: 221 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,221 @@
1+
# lazynlp
2+
3+
A straightfoward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this library, you should be able to create datasets larger than the one used by OpenAI for GPT-2.
4+
5+
## Setup
6+
This library uses Python 3.
7+
8+
1. Clone this library and cd into the lazynlp folder:
9+
10+
```
11+
git clone https://github.com/chiphuyen/lazynlp.git
12+
cd lazynlp
13+
```
14+
15+
2. Install dependencies
16+
17+
``
18+
pip3 install -r requirements.txt
19+
``
20+
21+
3. Install the library
22+
``
23+
python3 setup.py install
24+
``
25+
26+
If you want to uninstall the library, use:
27+
28+
``
29+
pip3 uninstall lazynlp
30+
``
31+
32+
## How to create a massive dataset using lazynlp:
33+
34+
### Step 1. Obtain URLs of the webpages you want to crawl
35+
There are several major dumps of URLs available that you can use.
36+
37+
#### Reddit URLs
38+
This is [the link to all submissions to Reddit by months](https://files.pushshift.io/reddit/submissions/)'s . You can download the raw dump and process to get the links. However, keep in mind that each of these dumps is huge (100MB - 1GB).
39+
40+
[@jcpeterson](https://github.com/jcpeterson) is kind enough to provide a list of deduplicated links with at least 3 karmas that you can download [here](https://drive.google.com/file/d/1hRtA3zZ0K5UHKOQ0_8d0BIc_1VyxgY51/view?usp=sharing).
41+
42+
There are about 23M URLs from between 2015-06 to 2018-10, of which around 40 - 60 \% are bad URLs (URLs no longer exist or aren't scraper-friendly).
43+
It means that after you've downloaded and cleaned all good URLs from this, you should have approx 10M webpages or 50GB of pure text.
44+
45+
#### Gutenberg
46+
You can download the list of all URLs to US Gutenberg books [here](). There are 50K books, which convert to about 14GB of pure text.
47+
48+
You can also run ``lazynlp.get_us_gutenberg_links()`` to get the same list. For example, if you want to get all the Gutenberg URLs and store it in the file ``us_gutenberg.urls``:
49+
50+
``
51+
lazynlp.get_us_gutenberg_links('us_gutenberg.urls')
52+
``
53+
54+
55+
You can download the list of all URLs to Australian Gutenberg books [here](https://drive.google.com/file/d/1C5aSisXMC3S3OXBFbnETLeK3UTUXEXrC/view?usp=sharing). There are 4k books, which convert to about 1GB of pure text.
56+
57+
You can also run ``lazynlp.get_aus_gutenberg_links()`` to get the same list. For example, if you want to get all the Gutenberg URLs and store it in the file ``aus_gutenberg.urls``:
58+
59+
``
60+
lazynlp.get_aus_gutenberg_links('aus_gutenberg.urls')
61+
``
62+
63+
#### Wikipedia
64+
You can download the Wikipedia dumps [here](https://dumps.wikimedia.org/).
65+
66+
67+
### Step 2. Deduplicate URLs
68+
You don't want to download the same URL multiple times. There are two functions that help you deduplicate all URLs:
69+
70+
``
71+
lazynlp.dedup_lines(files, outfold)
72+
``
73+
74+
This function takes in a list of files (in each file, each line is a URLs) and deduplicate each file against all previous files.
75+
Save all the deduplicated files in outfold.
76+
77+
``
78+
lazynlp.dedup_lines_from_new_file(original_files, new_file, outfile)
79+
``
80+
81+
This function allows you to deduplicate a new file against all previously deduplicated files (original_files)
82+
83+
### Step 3. Download the URLs
84+
85+
``
86+
lazynlp.download_pages(link_file, folder, timeout=30, default_skip=True, extensions=[], domains=[])
87+
``
88+
89+
90+
"""
91+
92+
link_file:
93+
94+
file contains links to webpages to crawl. Each line contains one URL.
95+
96+
folder:
97+
98+
folder that you want to contain your downloaded pages.
99+
100+
timeout:
101+
102+
seconds to wait for a page to respond before abandoning it.
103+
104+
default_skip:
105+
106+
set to True if you want to automatically skip all URLs that contain domains and extensions that are known to be scraper-unfriendly.
107+
108+
You can see the list of excluded domains at lazynlp/exclude_domains.txt.
109+
110+
You can see the list of excluded extensions at lazynlp/exclude_extensions.txt
111+
112+
You can also add your own domains and extensions to skip with domains and extensions and arguments.
113+
114+
In the folder:
115+
116+
Each URL is downloaded into a file, indexed by the order in which it is downloaded. The first line of each file is the URL. The rest is the textual content of the page.
117+
118+
index.urls contains all the URLs that have been successfully downloaded.
119+
120+
bad.urls contains the URLs that are bad.
121+
122+
connection.urls contains the URLs that haven't been downloaded because of connection issues.
123+
124+
non_ascii.urls contains the URLs that haven't been downloaded because of bad encoding issues.
125+
126+
empty.urls contains the URLs that have empty textual content.
127+
128+
"""
129+
130+
If you have a lot of URLs, you can divide the list into multiple files and call this function separately. I was able to run 40 scripts in parallel.
131+
I guess I could have parallizing the code. I just found this to be easier.
132+
133+
If you want to download each webpage separately, call:
134+
135+
``
136+
lazynlp.download_page(link, ctx=None, timeout=None)
137+
``
138+
139+
### Step 4. Clean the webpages
140+
You can get rid of all HTML tags, decode utf-8 into string, transliterate foreign characters, collapse white space, replace unprintable characters, unescape HTML, etc. using methods available in lazynlp/cleaner.py.
141+
142+
You can also just call
143+
144+
``
145+
lazynlp.clean_page(page)
146+
``
147+
148+
to do most of it.
149+
150+
Note:
151+
In this library, the function lazynlp.download_pages() does both the crawling and cleaning part, so the webpages you have are pure text, like this:
152+
153+
```
154+
http://www.thecannabist.co/2017/03/02/jeff-sessions-russia-resign-democrats/74687/
155+
Attorney general nominee Sen. Jeff Sessions, R-Ala., testifies on Capitol Hill in Washington on Jan. 10, 2017, in the first day of his confirmation hearing before the Senate Judiciary Committee. Top Democrats now say that because he misled the committee about his visits to Russia, he should resign. (Andrew Harnik, The Associated Press)
156+
157+
House Oversight and Government Reform Committee Chairman Jason Chaffetz, R-Utah, tweeted early Thursday that "AG Sessions should clarify his testimony and recuse himself."
158+
159+
Later, Sen. Rob Portman, R-Ohio, said in a statement, "Jeff Sessions is a former colleague and a friend, but I think it would be best for him and for the country to recuse himself from the DOJ Russia probe."
160+
161+
House Majority Leader Kevin McCarthy, R-Calif., also initially said during an appearance on MSNBC's "Morning Joe" that Sessions should bow out.
162+
163+
Asked whether Sessions should recuse himself in this situation, McCarthy replied "I think the trust of the American people -- you recuse yourself in these situations, yes."
164+
165+
McCarthy was pressed a second time about whether he was calling for Sessions to recuse himself and he confirmed that he believed the situation required a recusal.
166+
167+
"I think it would be easier from that standpoint, yes," McCarthy said.
168+
169+
But McCarthy later said his comment had been misinterpreted, telling Fox News' "Fox and Friends," "I'm not calling on him to recuse himself. I was asked on 'Morning Joe,' if he needs to recuse himself as going forward. As you just heard, Attorney General Sessions said he would recuse himself going forward -- appropriate, and that's all my answer was."
170+
171+
The comments from prominent Republicans follow revelations that Sessions met with the Russian ambassador during election season. Under oath in front of the Senate Judiciary Committee for his confirmation hearing in January, Sessions had said that he had not met with any Russian officials.
172+
173+
Senate Minority Leader Charles Schumer, D-N.Y., joined growing Democratic calls for Sessions to either resign or at least recuse himself from any investigations into Russia's meddling in U.S. elections.
174+
175+
"Attorney General Sessions cannot possibly lead an investigation into Russian interference in our elections or come anywhere near it. With these revelations, he may indeed become the subject of it," Schumer told reporters. "Better for the country if he resigns, but let's get an investigation going."
176+
177+
Because the Department of Justice should be above reproach, for the good of the country, the Attorney General should resign.
178+
```
179+
### Step 5. Remove duplicated webpages
180+
To avoid any piece of texts being over-represented, you want to only include pages that don't signicantly overlap with other pages.
181+
182+
To estimate the amount of overlapping of target files with certain source files, use this function:
183+
184+
``
185+
lazynlp.estimate_overlap(source_files, target_files, gran='word', n=8, capacity=10000, error_rate=1e-5, header=0, interval=100000)
186+
``
187+
188+
``gran`` is the granulary of tokens: 'char' or 'word' level.
189+
190+
``n`` is the n-gram.
191+
192+
``capacity`` and ``error_rate`` are for the BloomFilter used.
193+
194+
``header``: number of lines of each file to skip. It's because in our format, the first line is the url
195+
196+
To estimate the amount of overlapping of a target file with an existing BloomFilter, use this function:
197+
198+
``
199+
lazynlp.estimate_overlap_bf(bf, target_file, gran='word', n=8, header=0)
200+
``
201+
202+
If given a list of files, e.g. cleaned webpages, to filter out all the files that contain more than ``threshold`` overlapping with other files, use this function:
203+
204+
``
205+
lazynlp.filter_files(files, threshold=0.5, gran='word', n=8, capacity=100000000, error_rate=1e-7, header=0, interval=1000000)
206+
``
207+
208+
Names of all the files that are deemed duplicated are stored in ``dupped_files.list``
209+
210+
Names of all the files used for the dataset are stored in ``clean_files.list``
211+
212+
Some statistics to keep in mind:
213+
1. 1GB of text is about 1b characters. An English word has on average 4.5 characters, or 5.5 including whitespace.
214+
So 1GB of text is about 181M words.
215+
216+
2. When I ran 30 scripts in parallel, it took 3 hours to download and clean 1GB of pure text. So it'd take 5 days to get 50GB of pure text.
217+
218+
3. The OpenAI dataset has 40GB, which I estimate to contain about 7-8 billion words.
219+
If you download all the webpages from the good Reddit URLs and Gutenberg books, you should have a dataset bigger than OpenAI's WebText.
220+
221+
4. OpenAI, in their paper for GPT-2, didn't include Wikipedia artciles for fear of overlapping. You can choose to include Wikipedia articles that have less than a certain amount of overlapping with the existing dataset using ``lazynlp.estimate_overlap_bf(bf, target_file, gran='word', n=8``.

lazynlp/__init__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
__VERSION__ = '0.0.1'
2+
3+
from .analytics import *
4+
from .cleaner import *
5+
from .create import *
6+
from .crawl import *
7+
from .utils import *

lazynlp/analytics.py

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
import os
2+
import random
3+
import time
4+
5+
from pybloom import BloomFilter
6+
7+
from lazynlp.cleaner import *
8+
from lazynlp.utils import *
9+
10+
def build_ngram_from_tokens(tokens, n):
11+
""" Create a dictionary of n-gram from the list of tokens
12+
"""
13+
count = {}
14+
curr = tokens[:n]
15+
count[' '.join(curr)] = 1
16+
for token in tokens[n:]:
17+
curr = curr[1:] + [token]
18+
string = ' '.join(curr)
19+
if not string in count:
20+
count[string] = 0
21+
count[string] += 1
22+
return count
23+
24+
def build_ngram(file, outfile=None, bf=None, gran='word', n=10, uncase=True, alphanumeric=True, interval=100000):
25+
"""
26+
gran: granularity of the token. It can be 'word' or 'char'
27+
bf: BloomFilter to update the existence of n-grams. Use when the file is too large to store a dictionary count
28+
alphanumeric: whether to keep only alphanumeric characters and space.
29+
outfile: if outfile is specified, build dictionary of n-grams and write it to outfile
30+
interval: how often to report the progress.
31+
"""
32+
if not gran in set(['word', 'char']):
33+
raise ValueError("gran has to be 'word' or 'char'")
34+
count = {}
35+
f = open(file, 'r')
36+
i = 1
37+
line = f.readline()
38+
start = time.time()
39+
while line:
40+
line = line.strip()
41+
if line:
42+
if uncase:
43+
line = line.lower()
44+
45+
if gran == 'word':
46+
if alphanumeric:
47+
line = remove_non_alphanumeric(line)
48+
else:
49+
line = remove_non_alpha(line)
50+
line = collapse_white_spaces(line)
51+
tokens = line.split()
52+
line_count = build_ngram_from_tokens(tokens, n)
53+
54+
if outfile:
55+
count.update()
56+
57+
if not bf is None:
58+
for key in line_count:
59+
bf.add(key)
60+
61+
if interval > 0 and i % interval == 0:
62+
print('Process line: {}. Time: {}'.format(i, time.time() - start))
63+
start = time.time()
64+
65+
i += 1
66+
67+
line = f.readline()
68+
69+
f.close()
70+
71+
if outfile:
72+
outfold = outfile[:outfile.rfind('/')]
73+
os.makedirs(outfold, exist_ok=True)
74+
dict_sorted_2_file(count, os.path.join(outfile.format(n)))
75+
76+
if bf:
77+
return bf
78+
79+
return count
80+
81+
def build_word_ngram(file, outfile, n=10, alphanumeric=True, norm=True, interval=100000):
82+
""" Build word ngrams and store in outfile
83+
n-grams in the format:
84+
[n-gram][tab][count]
85+
86+
If alphanumeric, exclude all the words that contain non-alphanumeric characters
87+
"""
88+
return build_ngram(file, outfile=outfile, n=n, gran='word', alphanumeric=alphanumeric, norm=norm, interval=interval)
89+
90+
def build_char_ngram(file, outfile, n=10, interval=100000):
91+
"""
92+
Build character n-grams and store in outfile
93+
"""
94+
return build_ngram(file, outfile=outfile, n=n, gran='char', interval=interval)
95+
96+
def estimate_overlap(source_files, target_files, gran='word', n=8, capacity=10000, error_rate=1e-5, header=0, interval=100000):
97+
""" Estimate overlapping of target_files with source_files using n-grams
98+
gran: granularity of the token. It can be 'word' or 'char'
99+
header: number of lines of each file to skip. It's because in our format, the first line is the url
100+
"""
101+
if not gran in set(['word', 'char']):
102+
raise ValueError("gran has to be 'word' or 'char'")
103+
if isinstance(source_files, str):
104+
source_files = [source_files]
105+
if isinstance(target_files, str):
106+
target_files = [target_files]
107+
108+
bf = BloomFilter(capacity=capacity, error_rate=error_rate)
109+
for source_file in source_files:
110+
bf = build_ngram(file=source_file, bf=bf, gran=gran, n=n, uncase=True, alphanumeric=True, interval=interval)
111+
112+
results = []
113+
for file in target_files:
114+
print(file)
115+
results.append(estimate_overlap_bf(bf, file, gran=gran, n=8, header=header))
116+
return results
117+
118+
def estimate_overlap_bf(bf, target_file, gran='word', n=8, header=0):
119+
""" Estimate overlapping of target_file with an existing bloomfilter
120+
gran: granularity of the token. It can be 'word' or 'char'
121+
"""
122+
if not gran in set(['word', 'char']):
123+
raise ValueError("gran has to be 'word' or 'char'")
124+
125+
f = open(target_file, 'r')
126+
for _ in range(header + 1):
127+
line = f.readline()
128+
129+
total, seen = 0, 0
130+
while line:
131+
line = line.strip().lower()
132+
133+
if gran == 'word':
134+
line = remove_non_alphanumeric(line)
135+
else:
136+
line = remove_non_alpha(line)
137+
line = collapse_white_spaces(line)
138+
tokens = line.split()
139+
line_count = build_ngram_from_tokens(tokens, n)
140+
141+
for key in line_count:
142+
if key in bf:
143+
seen += 1
144+
total += 1
145+
146+
line = f.readline()
147+
148+
result = seen / total
149+
print('{} seen out of {}: {}'.format(seen, total, result))
150+
return result

0 commit comments

Comments
 (0)