Skip to content

Grab additional contributors from CMS #117

@MarcosSueiro

Description

@MarcosSueiro

Not all contributors are listed as "appearances" in the Publisher API (noted by Dylan --thank you!).

A. Articles and episodes seem to display additional contributors and tags in the html section <pageMap>, which is commented out.
Example: https://www.wnyc.org/story/trump-indicted-again/ (view the page source to see the <pageMap> section)

B. Other stories (such as segments) may not present additional contributors at all, and the link can only be seen in internal.wnyc.org (it does not seem to show in csv exports from the CMS either). However, contributors (guests) are often in bold in the body; they may also appear as tags such as john-doe or john_doe.
Example: http://www.wnycstudios.org/story/109322-exiled-president-baby-doc-returns-to-haiti/. (See also https://internal.wnyc.org/admin/cms/segment/109322/).

Steps in case (A):

  1. Load an html page using its slug as unparsed-text
  2. Parse out the section <pageMap> (commented out)
  3. Convert this section to xml
  4. Parse out contributors by selecting<DataObject type="person">
  5. Use the slug listed under @name='id'
  6. Use the Publisher API to obtain the additional contributor info

Potential steps in case (B):

  1. Parse out the Bold names.
  2. Parse out twitter-handle links, e.g. <a href="https://twitter.com/walterolson">Walter Olson</a>
  3. Use something like ParseContributors.xsl to extract potential Firstname Lastname combinations (could be slow).
  4. Check if tags such as "john-doe" or "john_doe" exist as https://www.wnyc.org/people/john-doe (not sure what this might prove)

This is in the source code of https://www.wnyc.org/story/trump-indicted-again/:

<!--
    <PageMap>
      <DataObject type="date">
        <Attribute name="display" value="Jun 09, 2023"/>
        <Attribute name="sort" value="20230609"/>
      </DataObject>
        <DataObject type="tag">
          <Attribute name="id">gop</Attribute>
        </DataObject>
        <DataObject type="tag">
          <Attribute name="id">maga</Attribute>
        </DataObject>
        <DataObject type="tag">
          <Attribute name="id">new_york_republican_party</Attribute>
        </DataObject>
        <DataObject type="tag">
          <Attribute name="id">politics</Attribute>
        </DataObject>
        <DataObject type="tag">
          <Attribute name="id">trump</Attribute>
        </DataObject>
        <DataObject type="person">
          <Attribute name="id">david-freedlander</Attribute>
        </DataObject>
    </PageMap>
    -->

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions