Skip to content

Commit 62a1dcd

Browse files
committed
## [v8.6]() (2019-02-22)
**Added** - We added sentiment, which can now be used in querying of articles and events. The `QueryArticles`, `QueryArticlesIter`, `QueryEvents`, `QueryEventsIter` constructors now all have additional parameters `minSentiment` and `maxSentiment` that can be used to filter the articles and events. The valid values are between -1 (very negative sentiment) and 1 (very positive sentiment). Value 0 represents neutral sentiment. - Sentiment was also added as a property in the returned articles and events. **Updated** - Analytics: We updated `trainTopicOnTweets()`, `trainTopicClearTopic()` and `trainTopicGetTrainedTopic()` methods in the `Analytics` class. - `QueryArticles.initWithComplexQuery()` was updated - the parameter `dataType` was removed (since the `dataType` value should be provided in the `$filter` section of the query) - `TopicPage` now supports setting also the source rank percentile - `Analytics.annotate()` method now supports passing custom parameters that should be used when annotating the text. - `Analytics.extractArticleInfo()` now also supports passing headers and cookies to be used when extracting article info from url. - Changed some defaults in the returned data. When searching articles, we now by default return article image and sentiment.
1 parent e97c318 commit 62a1dcd

23 files changed

+270
-120
lines changed

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,21 @@
11
# Change Log
22

3+
## [v8.6]() (2019-02-22)
4+
5+
**Added**
6+
- We added sentiment, which can now be used in querying of articles and events. The `QueryArticles`, `QueryArticlesIter`, `QueryEvents`, `QueryEventsIter` constructors now all have additional parameters `minSentiment` and `maxSentiment` that can be used to filter the articles and events. The valid values are between -1 (very negative sentiment) and 1 (very positive sentiment). Value 0 represents neutral sentiment.
7+
- Sentiment was also added as a property in the returned articles and events.
8+
9+
**Updated**
10+
11+
- Analytics: We updated `trainTopicOnTweets()`, `trainTopicClearTopic()` and `trainTopicGetTrainedTopic()` methods in the `Analytics` class.
12+
- `QueryArticles.initWithComplexQuery()` was updated - the parameter `dataType` was removed (since the `dataType` value should be provided in the `$filter` section of the query)
13+
- `TopicPage` now supports setting also the source rank percentile
14+
- `Analytics.annotate()` method now supports passing custom parameters that should be used when annotating the text.
15+
- `Analytics.extractArticleInfo()` now also supports passing headers and cookies to be used when extracting article info from url.
16+
- Changed some defaults in the returned data. When searching articles, we now by default return article image and sentiment.
17+
18+
319
## [v8.5]() (2018-08-29)
420

521
**Added**

eventregistry/Analytics.py

Lines changed: 46 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
NOTE: the functionality is currently in BETA. The API calls or the provided outputs may change in the future.
1111
"""
1212

13+
import json
1314
from eventregistry.Base import *
1415
from eventregistry.ReturnInfo import *
1516

@@ -21,14 +22,18 @@ def __init__(self, eventRegistry):
2122
self._er = eventRegistry
2223

2324

24-
def annotate(self, text, lang = None):
25+
def annotate(self, text, lang = None, customParams = None):
2526
"""
2627
identify the list of entities and nonentities mentioned in the text
2728
@param text: input text to annotate
2829
@param lang: language of the provided document (can be an ISO2 or ISO3 code). If None is provided, the language will be automatically detected
30+
@param customParams: None or a dict with custom parameters to send to the annotation service
2931
@returns: dict
3032
"""
31-
return self._er.jsonRequestAnalytics("/api/v1/annotate", { "lang": lang, "text": text })
33+
params = {"lang": lang, "text": text}
34+
if customParams:
35+
params.update(customParams)
36+
return self._er.jsonRequestAnalytics("/api/v1/annotate", params)
3237

3338

3439
def categorize(self, text, taxonomy = "dmoz"):
@@ -75,17 +80,27 @@ def detectLanguage(self, text):
7580
return self._er.jsonRequestAnalytics("/api/v1/detectLanguage", { "text": text })
7681

7782

78-
def extractArticleInfo(self, url, proxyUrl = None):
83+
def extractArticleInfo(self, url, proxyUrl = None, headers = None, cookies = None):
7984
"""
8085
extract all available information about an article available at url `url`. Returned information will include
8186
article title, body, authors, links in the articles, ...
8287
@param url: article url to extract article information from
8388
@param proxyUrl: proxy that should be used for downloading article information. format: {schema}://{username}:{pass}@{proxy url/ip}
89+
@param headers: dict with headers to set in the request (optional)
90+
@param cookies: dict with cookies to set in the request (optional)
8491
@returns: dict
8592
"""
8693
params = { "url": url }
8794
if proxyUrl:
8895
params["proxyUrl"] = proxyUrl
96+
if headers:
97+
if isinstance(headers, dict):
98+
headers = json.dumps(headers)
99+
params["headers"] = headers
100+
if cookies:
101+
if isinstance(cookies, dict):
102+
cookies = json.dumps(cookies)
103+
params["cookies"] = cookies
89104
return self._er.jsonRequestAnalytics("/api/v1/extractArticleInfo", params)
90105

91106

@@ -98,24 +113,34 @@ def ner(self, text):
98113
return self._er.jsonRequestAnalytics("/api/v1/ner", {"text": text})
99114

100115

101-
def trainTopicOnTweets(self, twitterQuery, useTweetText = True, maxConcepts = 20, maxCategories = 10, maxTweets = 2000, notifyEmailAddress = None):
116+
def trainTopicOnTweets(self, twitterQuery, useTweetText=True, useIdfNormalization=True,
117+
normalization="linear", maxTweets=2000, maxUsedLinks=500, ignoreConceptTypes=[],
118+
maxConcepts = 20, maxCategories = 10, notifyEmailAddress = None):
102119
"""
103120
create a new topic and train it using the tweets that match the twitterQuery
104121
@param twitterQuery: string containing the content to search for. It can be a Twitter user account (using "@" prefix or user's Twitter url),
105122
a hash tag (using "#" prefix) or a regular keyword.
106123
@param useTweetText: do you want to analyze the content of the tweets and extract the concepts mentioned in them? If False, only content shared
107124
in the articles in the user's tweets will be analyzed
125+
@param useIdfNormalization: normalize identified concepts by their IDF in the news (punish very common concepts)
126+
@param normalization: way to normalize the concept weights ("none", "linear")
127+
@param maxTweets: maximum number of tweets to collect (default 2000, max 5000)
128+
@param maxUsedLinks: maximum number of article links in the tweets to analyze (default 500, max 2000)
129+
@param ignoreConceptTypes: what types of concepts you would like to ignore in the profile. options: person, org, loc, wiki or an array with those
108130
@param maxConcepts: the number of concepts to save in the final topic
109131
@param maxCategories: the number of categories to save in the final topic
110132
@param maxTweets: the maximum number of tweets to collect for the user to analyze
111133
@param notifyEmailAddress: when finished, should we send a notification email to this address?
112134
"""
113135
assert maxTweets < 5000, "we can analyze at most 5000 tweets"
114-
params = {"twitterQuery": twitterQuery,
115-
"useTweetText": useTweetText, "maxConcepts": maxConcepts, "maxCategories": maxCategories,
116-
"maxTweets": maxTweets}
136+
params = {"twitterQuery": twitterQuery, "useTweetText": useTweetText,
137+
"useIdfNormalization": useIdfNormalization, "normalization": normalization,
138+
"maxTweets": maxTweets, "maxUsedLinks": maxUsedLinks,
139+
"maxConcepts": maxConcepts, "maxCategories": maxCategories }
117140
if notifyEmailAddress:
118141
params["notifyEmailAddress"] = notifyEmailAddress
142+
if len(ignoreConceptTypes) > 0:
143+
params["ignoreConceptTypes"] = ignoreConceptTypes
119144
return self._er.jsonRequestAnalytics("/api/v1/trainTopicOnTwitter", params)
120145

121146

@@ -127,31 +152,32 @@ def trainTopicCreateTopic(self, name):
127152
return self._er.jsonRequestAnalytics("/api/v1/trainTopic", { "action": "createTopic", "name": name})
128153

129154

130-
def trainTopicAddDocument(self, uri, text):
155+
def trainTopicClearTopic(self, uri):
131156
"""
132-
add the information extracted from the provided "text" to the topic with uri "uri"
133-
@param uri: uri of the topic (obtained by calling trainTopicCreateTopic method)
134-
@param text: text to analyze and extract information from
157+
if the topic is already existing, clear the definition of the topic. Use this if you want to retrain an existing topic
158+
@param uri: uri of the topic (obtained by calling trainTopicCreateTopic method) to clear
135159
"""
136-
return self._er.jsonRequestAnalytics("/api/v1/trainTopic", { "action": "addDocument", "uri": uri, "text": text})
160+
return self._er.jsonRequestAnalytics("/api/v1/trainTopic", { "action": "clearTopic", "uri": uri })
137161

138162

139-
def trainTopicFinishTraining(self, uri, maxConcepts = 20, maxCategories = 10, idfNormalization = True):
163+
def trainTopicAddDocument(self, uri, text):
140164
"""
141165
add the information extracted from the provided "text" to the topic with uri "uri"
142166
@param uri: uri of the topic (obtained by calling trainTopicCreateTopic method)
143-
@param maxConcepts: number of top concepts to save in the topic
144-
@param maxCategories: number of top categories to save in the topic
145-
@param idfNormalization: should the concepts be normalized by punishing the commonly mentioned concepts
146-
@param returns: returns the trained topic: { concepts: [], categories: [] }
167+
@param text: text to analyze and extract information from
147168
"""
148-
return self._er.jsonRequestAnalytics("/api/v1/trainTopic", {"action": "finishTraining", "uri": uri, "maxConcepts": maxConcepts, "maxCategories": maxCategories, "idfNormalization": idfNormalization})
169+
return self._er.jsonRequestAnalytics("/api/v1/trainTopic", { "action": "addDocument", "uri": uri, "text": text})
149170

150171

151-
def trainTopicGetTrainedTopic(self, uri):
172+
def trainTopicGetTrainedTopic(self, uri, maxConcepts = 20, maxCategories = 10,
173+
ignoreConceptTypes=[], idfNormalization = True):
152174
"""
153175
retrieve topic for the topic for which you have already finished training
154176
@param uri: uri of the topic (obtained by calling trainTopicCreateTopic method)
177+
@param maxConcepts: number of top concepts to retrieve in the topic
178+
@param maxCategories: number of top categories to retrieve in the topic
179+
@param ignoreConceptTypes: what types of concepts you would like to ignore in the profile. options: person, org, loc, wiki or an array with those
180+
@param idfNormalization: should the concepts be normalized by punishing the commonly mentioned concepts
155181
@param returns: returns the trained topic: { concepts: [], categories: [] }
156182
"""
157-
return self._er.jsonRequestAnalytics("/api/v1/trainTopic", { "action": "getTrainedTopic", "uri": uri })
183+
return self._er.jsonRequestAnalytics("/api/v1/trainTopic", { "action": "getTrainedTopic", "uri": uri, "maxConcepts": maxConcepts, "maxCategories": maxCategories, "idfNormalization": idfNormalization })

eventregistry/EventRegistry.py

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,10 @@ def getUrl(self, query):
165165
# don't modify original query params
166166
allParams = query._getQueryParams()
167167
# make the url
168-
url = self._host + query._getPath() + "?" + urllib.urlencode(allParams, doseq=True)
168+
try:
169+
url = self._host + query._getPath() + "?" + urllib.urlencode(allParams, doseq=True)
170+
except:
171+
url = self._host + query._getPath() + "?" + urllib.parse.urlencode(allParams, doseq=True)
169172
return url
170173

171174

@@ -234,7 +237,7 @@ def jsonRequest(self, methodUrl, paramDict, customLogFName = None, allowUseOfArc
234237
with open(customLogFName or self._requestLogFName, "a") as log:
235238
if paramDict != None:
236239
log.write("# " + json.dumps(paramDict) + "\n")
237-
log.write(methodUrl + "\n")
240+
log.write(methodUrl + "\n\n")
238241
except Exception as ex:
239242
self._lastException = ex
240243

@@ -292,6 +295,7 @@ def jsonRequest(self, methodUrl, paramDict, customLogFName = None, allowUseOfArc
292295
# in case of invalid input parameters, don't try to repeat the search
293296
if respInfo != None and respInfo.status_code == 530:
294297
break
298+
print("The request will be automatically repeated in 3 seconds...")
295299
time.sleep(3) # sleep for X seconds on error
296300
self._lock.release()
297301
if returnData == None:
@@ -327,9 +331,13 @@ def jsonRequestAnalytics(self, methodUrl, paramDict):
327331
break
328332
except Exception as ex:
329333
self._lastException = ex
330-
print("Event Registry exception while executing the request:")
334+
print("Event Registry Analytics exception while executing the request:")
331335
self.printLastException()
332-
break
336+
# in case of invalid input parameters, don't try to repeat the action
337+
if respInfo != None and respInfo.status_code == 530:
338+
print("The request will not be repeated since we received a response code 530")
339+
break
340+
print("The request will be automatically repeated in 3 seconds...")
333341
time.sleep(3) # sleep for X seconds on error
334342
self._lock.release()
335343
if returnData == None:

eventregistry/QueryArticles.py

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@ def __init__(self,
3737
eventFilter = "keepAll",
3838
startSourceRankPercentile = 0,
3939
endSourceRankPercentile = 100,
40+
minSentiment = -1,
41+
maxSentiment = 1,
4042
dataType = "news",
4143
requestedResult = None):
4244
"""
@@ -103,6 +105,10 @@ def __init__(self,
103105
"keepAll" (no filtering, default)
104106
@param startSourceRankPercentile: starting percentile of the sources to consider in the results (default: 0). Value should be in range 0-90 and divisible by 10.
105107
@param endSourceRankPercentile: ending percentile of the sources to consider in the results (default: 100). Value should be in range 10-100 and divisible by 10.
108+
@param minSentiment: minimum value of the sentiment, that the returned articles should have. Range [-1, 1]. Note: setting the value will remove all articles that don't have
109+
a computed value for the sentiment (all non-English articles)
110+
@param maxSentiment: maximum value of the sentiment, that the returned articles should have. Range [-1, 1]. Note: setting the value will remove all articles that don't have
111+
a computed value for the sentiment (all non-English articles)
106112
@param dataType: what data types should we search? "news" (news content, default), "pr" (press releases), or "blog".
107113
If you want to use multiple data types, put them in an array (e.g. ["news", "pr"])
108114
@param requestedResult: the information to return as the result of the query. By default return the list of matching articles
@@ -160,6 +166,12 @@ def __init__(self,
160166
self._setVal("startSourceRankPercentile", startSourceRankPercentile)
161167
if endSourceRankPercentile != 100:
162168
self._setVal("endSourceRankPercentile", endSourceRankPercentile)
169+
if minSentiment != -1:
170+
assert minSentiment >= -1 and minSentiment <= 1
171+
self._setVal("minSentiment", minSentiment)
172+
if maxSentiment != 1:
173+
assert maxSentiment >= -1 and maxSentiment <= 1
174+
self._setVal("maxSentiment", maxSentiment)
163175
# always set the data type
164176
self._setVal("dataType", dataType)
165177

@@ -244,7 +256,7 @@ def count(self, eventRegistry):
244256
def execQuery(self, eventRegistry,
245257
sortBy = "rel",
246258
sortByAsc = False,
247-
returnInfo = ReturnInfo(),
259+
returnInfo = None,
248260
maxItems = -1,
249261
**kwargs):
250262
"""
@@ -270,15 +282,11 @@ def execQuery(self, eventRegistry,
270282

271283

272284
@staticmethod
273-
def initWithComplexQuery(query, dataType = "news"):
285+
def initWithComplexQuery(query):
274286
"""
275287
@param query: complex query as ComplexArticleQuery instance, string or a python dict
276-
@param dataType: what data types should we search? "news" (news content, default), "pr" (press releases), or "blog".
277-
If you want to use multiple data types, put them in an array (e.g. ["news", "pr"])
278288
"""
279289
q = QueryArticlesIter()
280-
# set data type
281-
q._setVal("dataType", dataType)
282290

283291
# provided an instance of ComplexArticleQuery
284292
if isinstance(query, ComplexArticleQuery):
@@ -360,7 +368,7 @@ def __init__(self,
360368
page = 1,
361369
count = 100,
362370
sortBy = "date", sortByAsc = False,
363-
returnInfo = ReturnInfo()):
371+
returnInfo = None):
364372
"""
365373
return article details for resulting articles
366374
@param page: page of the articles to return
@@ -376,7 +384,8 @@ def __init__(self,
376384
self.articlesCount = count
377385
self.articlesSortBy = sortBy
378386
self.articlesSortByAsc = sortByAsc
379-
self.__dict__.update(returnInfo.getParams("articles"))
387+
if returnInfo != None:
388+
self.__dict__.update(returnInfo.getParams("articles"))
380389

381390

382391
def setPage(self, page):
@@ -597,7 +606,7 @@ def __init__(self,
597606
updatesUntilTm = None,
598607
updatesUntilMinsAgo = None,
599608
mandatorySourceLocation = False,
600-
returnInfo = ReturnInfo()):
609+
returnInfo = None):
601610
"""
602611
get the list of articles that were recently added to the Event Registry and match the selected criteria
603612
@param maxArticleCount: the maximum number of articles to return in the call (the number can be even higher than 100 but in case more articles
@@ -624,4 +633,5 @@ def __init__(self,
624633
self.recentActivityArticlesUpdatesUntilMinsAgo = updatesUntilMinsAgo
625634
self.recentActivityArticlesMaxArticleCount = maxArticleCount
626635
self.recentActivityArticlesMandatorySourceLocation = mandatorySourceLocation
627-
self.__dict__.update(returnInfo.getParams("recentActivityArticles"))
636+
if returnInfo != None:
637+
self.__dict__.update(returnInfo.getParams("recentActivityArticles"))

0 commit comments

Comments
 (0)