-
Notifications
You must be signed in to change notification settings - Fork 75
Description
Hello,
First of all, commendable job. Thank you for your work.
I'm working on a Jupyter notebook, which will be a tutorial on how to use Riko to access unstructured website data in a structured manner. When I finish it, I will send you a pull request with the notebook (or get it to you in an alternative way), as I think it could be a great beginner's guide for everyone who'd like to use Riko.
As I am preparing the notebook, I ran in to an interesting situation: when I am parsing <li>
elements using the xpathfetchpage
and if those elements have other elements nested underneath it, the keys to those nested elements have a weird {http://www.w3.org/1999/xhtml}
prefix. The following code snippet can illustrate it:
url = 'http://www.sozcu.com.tr/kategori/yazarlar/yilmaz-ozdil/'
xpath = '/html/body/div[5]/div[6]/div[3]/div[1]/div[2]/div[1]/div[1]/div[2]/ul/li/a'
xpath_conf = {'xpath': xpath, 'url': url}
flow_main = SyncPipe('xpathfetchpage', conf=xpath_conf)
print next(flow_main.output)
This prints:
{
u'href': u'http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/gata-nedir-diye-merak-ediyorsaniz-bu-fotografa-iyi-bakin-1450145/',
u'{http://www.w3.org/1999/xhtml}p': u'GATA nedir diye merak ediyorsan\u0131z bu foto\u011frafa iyi bak\u0131n',
u'{http://www.w3.org/1999/xhtml}span': {
u'content': u'16 Ekim 2016',
u'class': u'date'
},
u'title': u'GATA nedir diye merak ediyorsan\u0131z bu foto\u011frafa iyi bak\u0131n'
}
for the fetched structure:
<a href="http://www.sozcu.com.tr/2016/yazarlar/yilmaz-ozdil/gata-nedir-diye-merak-ediyorsaniz-bu-fotografa-iyi-bakin-1450145/" title="GATA nedir diye merak ediyorsanız bu fotoğrafa iyi bakın">
<p>GATA nedir diye merak ediyorsanız bu fotoğrafa iyi bakın</p>
<span class="date">16 Ekim 2016</span>
</a>
(This page is updated daily so the exact output might differ when you run it but the structure remains the same)
I was unable to figure out why there's that '{http://www.w3.org/1999/xhtml}
' prefix on the nested key values or how to get rid of them. I understand that it differentiates between the attributes of a tag and the nested elements but maybe there is a flag (that I was unable to find) to retrieve them as a list under a key like 'child
' in top-level dictionary.
Thank you for your assistance.