Skip to content

Searching for articles

Gregor Leban edited this page Jun 11, 2017 · 41 revisions

In order to search for articles in Event Registry we provide two classes - QueryArticles and QueryArticlesIter. Both classes can be used to find articles using a set of various types of search conditions.

The class QueryArticlesIter is meant to obtain an iterator, that makes it easy to iterate over all articles that match the search conditions. Alternatively, the QueryArticles class can be used to obtain a broader range of information about the matching articles in various forms. In case of QueryArticles, the results can be not only the list of articles, but also a time distribution when articles were published, distribution of top news sources that wrote the matching articles, top concepts mentioned in the articles, etc.

The returned information about articles follows the Article data model.

QueryArticlesIter

Example of usage

Before describing the class, here is a simple full example that prints the list of all articles that mention George Clooney:

from eventregistry import *
er = EventRegistry(apiKey = YOUR_API_KEY)
q = QueryArticlesIter(conceptUri = er.getConceptUri("George Clooney"))
for article in q.execQuery(er):
    print article

Constructor

QueryArticlesIter is a derived class from QueryArticles. It's constructor can accept the following arguments:

QueryArticlesIter(keywords = "",
        conceptUri = [],
        sourceUri = [],
        locationUri = [],
        categoryUri = [],
        lang = [],
        dateStart = "",
        dateEnd = "",
        dateMentionStart = "",
        dateMentionEnd = "",
        ignoreKeywords = "",
        ignoreConceptUri = [],
        ignoreLocationUri = [],
        ignoreSourceUri = [],
        ignoreCategoryUri = [],
        ignoreLang = [],
        categoryIncludeSub = True,
        ignoreCategoryIncludeSub = True)

The parameters for which you don't specify a value will be ignored. In order for the query to be valid (=it can be executed by Event Registry), it has to have at least one positive condition (conditions that start with ignore* do not count as positive conditions). The meaning of the arguments is the following

  • keywords: find articles that mention all the specified keywords. In case of multiple keywords, separate them with space. Example: "apple iphone".
  • conceptUri: find articles where the concept with concept URI is mentioned. A single concept URI can be provided as a string, multiple concept URIs can be provided as a list of strings. If multiple concept URIs are provided, resulting articles have to mention all of them. To obtain a concept URI using a concept label use EventRegistry.getConceptUri().
  • sourceUri: find articles that were written by a news source sourceUri. If multiple sources are provided, resulting articles have to be written by any of the provided news sources. Source URI for a given news source name can be obtained using EventRegistry.getNewsSourceUri().
  • locationUri: find articles that describe an event that occured at a particular location. Location URI can either be a city or a country. If multiple locations are provided, resulting articles have to match any of the locations. Location URI for a given name can be obtained using EventRegistry.getLocationUri().
  • categoryUri: find articles that are assigned into a particular category. If multiple categories are provided, resulting articles have to be assigned to any of the categories. A category URI can be obtained from a category name using EventRegistry.getCategoryUri().
  • lang: find articles that are written in the specified language. If more than one language is specified, resulting articles has to be written in any of the languages.
  • dateStart: find articles that were written on or after dateStart. Date should be provided in YYYY-MM-DD format, datetime.time or datetime.datetime.
  • dateEnd: find articles that occured before or on dateEnd. Date should be provided in YYYY-MM-DD format, datetime.time or datetime.datetime.
  • dateMentionStart: find articles that explicitly mention a date that is equal or greater than dateMentionStart.
  • dateMentionEnd: find articles that explicitly mention a date that is lower or equal to dateMentionEnd.
  • ignoreKeywords: ignore articles that mention all provided keywords
  • ignoreConceptUri: ignore articles that mention all provided concepts
  • ignoreLocationUri: ignore articles that occured in any of the provided locations. A location can be a city or a place
  • ignoreSourceUri: ignore articles that have been written by any of the specified news sources
  • ignoreLang: ignore articles that are written in any of the provided languages
  • categoryIncludeSub: when a category is specified using categoryUri, should also all subcategories be included?
  • ignoreCategoryIncludeSub: when a category is specified using ignoreCategoryUri, should also all subcategories be included?

Methods

The class has two main methods: count() and execQuery().

count(er) simply returns the number of articles that match the specified conditions. er is the instance of EventRegistry class.

execQuery method has the following format:

execQuery(er,
        sortBy = "rel",
        sortByAsc = False,
        returnInfo = ReturnInfo(),
        articleBatchSize = 200)
  • er: instance of the EventRegistry class
  • sortBy: sets the order in which the resulting articles are sorted, before returning. Options: id (internal id), date (publishing date), cosSim (closeness to the centroid of the associated event), fq (relevance to the query), socialScore (total shares on social media).
  • sortByAsc: should the results be sorted in ascending order
  • returnInfo: sets the properties of various types of data that is returned (events, concepts, categories, ...). See details.

QueryArticles

Example of usage

Before describing the QueryArticles() class and the details that can be requested, let's look at an example of it's usage:

from eventregistry import *
er = EventRegistry(apiKey = YOUR_API_KEY)
q = QueryArticles()
# set the date limit of interest
q.setDateLimit(datetime.date(2014, 4, 16), datetime.date(2014, 4, 28))
# find articles mentioning the company Apple
q.addConcept(er.getConceptUri("Apple"))
# return the list of top 30 articles, including the concepts, categories and article image
q.setRequestedResult(RequestArticlesInfo(page = 1, count = 30,
    returnInfo = ReturnInfo(articleInfo = ArticleInfoFlags(concepts = True, categories = True, image = True))))
res = er.execQuery(q)

The returned information about articles in ret follows the Article data model.

Constructor

QueryArticles constructor accepts the following arguments:

QueryArticles(keywords = "",
        conceptUri = [],
        sourceUri = [],
        locationUri = [],
        categoryUri = [],
        lang = [],
        dateStart = "",
        dateEnd = "",
        dateMentionStart = "",
        dateMentionEnd = "",
        ignoreKeywords = "",
        ignoreConceptUri = [],
        ignoreLocationUri = [],
        ignoreSourceUri = [],
        ignoreCategoryUri = [],
        ignoreLang = [],
        categoryIncludeSub = True,
        ignoreCategoryIncludeSub = True)

The parameters for which you don't specify a value will be ignored. In order for the query to be valid (=it can be executed by Event Registry), it has to have at least one positive condition (conditions that start with ignore* do not count as positive conditions). The meaning of the arguments is the following

  • keywords: find articles that mention all the specified keywords. In case of multiple keywords, separate them with space. Example: "apple iphone".
  • conceptUri: find articles where the concept with concept URI is mentioned. A single concept URI can be provided as a string, multiple concept URIs can be provided as a list of strings. If multiple concept URIs are provided, resulting articles have to mention all of them. To obtain a concept URI using a concept label use EventRegistry.getConceptUri().
  • sourceUri: find articles that were written by a news source sourceUri. If multiple sources are provided, resulting articles have to be written by any of the provided news sources. Source URI for a given news source name can be obtained using EventRegistry.getNewsSourceUri().
  • locationUri: find articles that describe an event that occured at a particular location. Location URI can either be a city or a country. If multiple locations are provided, resulting articles have to match any of the locations. Location URI for a given name can be obtained using EventRegistry.getLocationUri().
  • categoryUri: find articles that are assigned into a particular category. If multiple categories are provided, resulting articles have to be assigned to any of the categories. A category URI can be obtained from a category name using EventRegistry.getCategoryUri().
  • lang: find articles that are written in the specified language. If more than one language is specified, resulting articles has to be written in any of the languages.
  • dateStart: find articles that were written on or after dateStart. Date should be provided in YYYY-MM-DD format, datetime.time or datetime.datetime.
  • dateEnd: find articles that occured before or on dateEnd. Date should be provided in YYYY-MM-DD format, datetime.time or datetime.datetime.
  • dateMentionStart: find articles that explicitly mention a date that is equal or greater than dateMentionStart.
  • dateMentionEnd: find articles that explicitly mention a date that is lower or equal to dateMentionEnd.
  • ignoreKeywords: ignore articles that mention all provided keywords
  • ignoreConceptUri: ignore articles that mention all provided concepts
  • ignoreLocationUri: ignore articles that occured in any of the provided locations. A location can be a city or a place
  • ignoreSourceUri: ignore articles that have been written by any of the specified news sources
  • ignoreLang: ignore articles that are written in any of the provided languages
  • categoryIncludeSub: when a category is specified using categoryUri, should also all subcategories be included?
  • ignoreCategoryIncludeSub: when a category is specified using ignoreCategoryUri, should also all subcategories be included?

Methods

QueryArticles class provides additional methods that can be used to specify relevant query information. Conditons can also be added using methods such as addConcept(conceptUri), addLocation(locationUri), addCategory(categoryUri), addNewsSource(sourceUri), addKeyword(keyword) and setDateLimit(startDate, endDate).

setArticleIdList(idList) is a special method where you can specify the set of article IDs that you want to use as the result. In this case, all query conditions are ignored and this set is used as the resulting set. All the return information about the articles will be based on this set of articles.

Returned information

When executing the query, there will be a set of articles that will match the specified criteria. What information about these articles is to be returned however still needs to be determined. Do you want to get the details about these articles? Are you interested in the top concepts mentioned in them? Maybe news sources?

The information to be returned about the matching articles is set by calling the setRequestedResult() method. The setRequestedResult() accepts as an argument an instance that has a base class RequestArticles. By calling setRequestedResult() method multiple times on a QueryArticles instance you can retrieve multiple results with a single query. Free users are only allowed one requested result per call and should instead use the setRequestedResult() method. Below are the classes that can be specified in the addRequestedResult() and setRequestedResult() calls:

RequestArticlesInfo

RequestArticlesInfo(page = 1,
        count = 20,
        sortBy = "date", sortByAsc = False,
        returnInfo = ReturnInfo())

RequestArticlesInfo class provides detailed information about the resulting articles.

  • page: determines the page of the results to return (starting from 1)
  • count: determines the number of articles to return. Max articles that can be returned per call is 200.
  • sortBy: sets the order in which the resulting articles are first sorted, before returning. Options: id (internal id), date (publishing date), cosSim (closeness to the event centroid), fq (relevance to the query), socialScore (total shares on social media), facebookShares (Facebook shares), twitterShares (Twitter shares)
  • sortByAsc: should the results be sorted in ascending order
  • returnInfo: sets the properties of various types of data that is returned (concepts, categories, news sources, ...). See details.

RequestArticlesUriList

RequestArticlesUriList returns a simple list of article URIs that match criteria. Useful if you wish to obtain the full list in a single query.

RequestArticlesIdList

RequestArticlesIdList returns a simple list of article IDs that match criteria. Useful if you wish to obtain the full list in a single query.

RequestArticlesTimeAggr

RequestArticlesTimeAggr returns information how the distribution of the resulting articles per time.

RequestArticlesConceptAggr

RequestArticlesConceptAggr returns a list of top concepts that are mentioned the most in the resulting articles

RequestArticlesSourceAggr

RequestArticlesSourceAggr provides a list of top news sources that have written the most articles in the results

RequestArticlesCategoryAggr

RequestArticlesCategoryAggr returns information about what categories are the resulting articles about.

RequestArticlesKeywordAggr

RequestArticlesKeywordAggr(lang = "eng",
        articlesSampleSize = 500)

RequestArticlesKeywordAggr returns the keywords that summarize the best the resulting articles.

  • lang: determines the language for which to compute the keywords. Articles in other languages will be ignored
  • articlesSampleSize: the sample size of articles on which to compute the keywords.

RequestArticlesConceptGraph

RequestArticlesConceptGraph(conceptCount = 25,
        linkCount = 50,
        sampleSize = 500,
        returnInfo = ReturnInfo())

RequestArticlesConceptGraph returns a graph of concepts. Concepts are connected if they frequently occuring in the same articles.

  • conceptCount: number of top concepts (nodes) to return
  • linkCount: number of edges in the graph
  • sampleSize: on what sample of articles should the graph be computed
  • returnInfo: the details about the types of return data to include. See details.

RequestArticlesConceptMatrix

RequestArticlesConceptMatrix(conceptCount = 25,
        measure = "pmi",
        sampleSize = 500,
        returnInfo = ReturnInfo())

RequestArticlesConceptMatrix computes a matrix of concepts and their dependencies. For individual concept pairs it returns how frequently they co-occur in the resulting articles and how "surprising" this is, based on the frequency of individual concepts.

  • conceptCount: the number of concepts on which to compute the matrix
  • measure: the measure to be used for computing the "surprise factor". Options: pmi (pointwise mutual information), pairTfIdf (pair frequence * IDF of individual concepts), chiSquare.
  • sampleSize: on what sample of articles should the matrix be computed
  • returnInfo: the details about the types of return data to include. See details.

RequestArticlesConceptTrends

RequestArticlesConceptTrends(conceptCount = 10,
        returnInfo = ReturnInfo())

RequestArticlesConceptTrends provides a list of most popular concepts in the results and how they daily trend over time

  • conceptCount: number of top concepts to return
  • returnInfo: the details about the types of return data to include. See details.

RequestArticlesDateMentionAggr

RequestArticlesDateMentionAggr provides information about the dates that have been found mentioned in the resulting articles.

Advanced query language

For some users, simply providing a list of concepts, keywords, sources etc. is not sufficient and a more complex way of providing a query is required. For such purposes we provide a query language where conditions can be specified in particular JSON object, that resembles the query language used by the MongoDB. The grammar for the language is as follows:

ComplexArticleQuery ::=
{
        "$query": CombinedQuery | BaseQuery,

        "isDuplicateFilter": null | "keepAll" | "skipDuplicates" | "keepOnlyDuplicates",
        "hasDuplicateFilter": null | "keepAll" | "skipHasDuplicates" | "keepOnlyHasDuplicates",
        "eventFilter": null | "keepAll" | "skipArticlesWithoutEvent" | "keepOnlyArticlesWithoutEvent"
}

CombinedQuery ::=
{
        "$or": [ CombinedQuery | BaseQuery, ... ],
        "$not": null | CombinedQuery | BaseQuery
}

CombinedQuery ::=
{
        "$and": [ CombinedQuery | BaseQuery, ... ],
        "$not": null | CombinedQuery | BaseQuery
}

BaseQuery ::=
{
	"conceptUri": null | string | { "$or": [ string, ... ]} | { "$and": [ string, ... ]},
	"keyword": null | string | { "$or": [ string, ... ]} | { "$and": [ string, ... ]},
	"categoryUri": null | string | { "$or": [ string, ... ]} | { "$and": [ string, ... ]},
	"sourceUri": null | string | { "$or": [ string, ... ]} | { "$and": [ string, ... ]},
	"sourceLocationUri": null | string | { "$or": [ string, ... ]} | { "$and": [ string, ... ]},
	"locationUri": null | string | { "$or": [ string, ... ]} | { "$and": [ string, ... ]},
	"lang": null | string | { "$or": [ string, ... ]} | { "$and": [ string, ... ]},
	"startDate": null | string,
	"endDate": null | string,
	"dateMention": null | [string, ... ]

	"minArticlesInEvent": null | int,
	"maxArticlesInEvent": null | int,

        "$not": null | CombinedQuery | BaseQuery
}

Explanation: Each complex article query needs to be a JSON object that has a $query key. The $query key must contain another JSON object that should be parsable as a CombinedQuery or a BaseQuery. A CombinedQuery can be used to specify a list of conditions, where all ($and) or any ($or) conditions should hold. The CombinedQuery can also contain a $not key containing another CombinedQuery or BaseQuery defining the results that should be excluded from the results computed by the $and or $or conditions. The BaseQuery represents a JSON object with actual conditions to search for. These (positive) conditions can include concepts, keywords, categories, sources, etc to search for. If multiple conditions are specified, for example, a conceptUri as well as a sourceUri, then results will have to match all the conditions. The BaseQuery can also contain the $not key specifying results to exclude from the results matching the positive conditions of the BaseQuery. A BaseQuery containing only the $not key is not a valid query (since it has no positive conditions).

Using this language you can specify queries that are not possible to express using the constructor parameters in QueryArticles or QueryArticlesIter. Here are some examples of queries and what they would return:

A query that would return the list of articles that mention AI or deep learning or machine learning:

{
        "$query": {
                "$or": [
                        { "conceptUri": "http://en.wikipedia.org/wiki/Artificial_Intelligence" },
                        {
                                "keyword": {
                                        $or: [ "deep learning", "machine learning" ]
                                }
                        }
                ]
        }
}

A query that would return the list of politics related articles about Donald Trump or Hillary Clinton, or business related news that mention Elon Musk:

{
        "$query": {
                "$or": [
                        {
                                "conceptUri": {
                                        "$or": [
                                                "http://en.wikipedia.org/wiki/Donald_Trump",
                                                "http://en.wikipedia.org/wiki/Hillary_Rodham_Clinton"
                                        ]
                                },
                                "categoryUri": "dmoz/Society/Politics"
                        },
                        {
                                "conceptUri": "http://en.wikipedia.org/wiki/Elon_Musk",
                                "categoryUri": "dmoz/Business"
                        }
                ]
        }
}

Depending on your preference, you can build such JSONs for these complex queries yourself or you can use the associated classes such as ComplexArticleQuery(), CombinedQuery() and BaseQuery(). Below is an example where we search for articles that are either about Donald Trump or are in the Politics category, but were not published in February 2017 or mention Barack Obama:

er = EventRegistry()
trumpUri = er.getConceptUri("Trump")
obamaUri = er.getConceptUri("Obama")
politicsUri = er.getCategoryUri("politics")
cq = ComplexArticleQuery(
        query = CombinedQuery.OR(
                [
                        BaseQuery(conceptUri = trumpUri),
                        BaseQuery(categoryUri = politicsUri)
                ],
                exclude = CombinedQuery.OR([
                        BaseQuery(dateStart = "2017-02-01", dateEnd = "2017-02-28"),
                        BaseQuery(conceptUri = obamaUri)])
                )
        )
q = QueryArticles.initWithComplexQuery(cq)
q.setRequestedResult(RequestArticlesInfo())
res = self.er.execQuery(q)

If you've built the JSON query yourself, you can also use like this:

er = EventRegistry()
q = QueryArticles.initWithComplexQuery("{ '$query': { ... } }")
q.setRequestedResult(RequestArticlesInfo())
res = self.er.execQuery(q)

In this case you need to make sure you're providing a valid query in the JSON.

If you would like to simply iterate through the results that match the query you can of course also use QueryArticlesIter.initWithComplexQuery() instead of QueryArticles.initWithComplexQuery().

Clone this wiki locally