Skip to content

feat: rewrite metadata urls #509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions src/steps/extract-metadata.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import { toString } from 'hast-util-to-string';
import { remove } from 'unist-util-remove';
import { visit, EXIT, CONTINUE } from 'unist-util-visit';
import {
getAbsoluteUrl, makeCanonicalHtmlUrl, optimizeImageURL, resolveUrl,
getAbsoluteUrl, makeCanonicalHtmlUrl, optimizeImageURL, resolveUrl, makeAbsoluteURLForMeta,
} from './utils.js';
import { toMetaName } from '../utils/modifiers.js';
import { childNodes } from '../utils/hast-utils.js';
Expand Down Expand Up @@ -293,10 +293,16 @@ export default function extractMetaData(state, req) {
}
}

// remove undefined metadata
for (const name of Object.keys(metadata)) {
if (metadata[name] === undefined) {
// remove undefined metadata
delete metadata[name];
} else if (Array.isArray(metadata[name])) {
// make absolute URLs for arrays
metadata[name] = metadata[name].map((value) => makeAbsoluteURLForMeta(state, value));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could reuse rewriteUrls here. That util is also used in the rewrite-urls step and turns URLs like this into path-only URLs. Or do we really need absolute URLs here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do we really need absolute URLs here?

or add a flag makeAbsolute to rewrite-urls()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it would probably be better to apply the rewriting in render, as the logic in metadata is already a bit complex:

for (const [name, value] of Object.entries(meta.page)) {
const attr = name.includes(':') && !name.startsWith('twitter:') ? 'property' : 'name';
if (Array.isArray(value)) {
for (const v of value) {
appendElement($head, createElement('meta', attr, name, 'content', v));
}
} else {
appendElement($head, createElement('meta', attr, name, 'content', value));
}
}
appendElement($head, createElement('link', 'rel', 'alternate', 'type', 'application/xml+atom', 'href', meta.feed, 'title', `${meta.title} feed`));

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta are strings only (not anchor with href), I think it must stay absolute urls - you would need a special treatment to consume the meta if it can be a relative or an absolute url (if external link). In fact, making them relative might break all existing logic...

I initially started with rewriteUrls but several things:

  1. it does not make absolute urls but relative - the extra makeAbsolute flag would be required and this adds extra logic which is currently isolated in "my" makeAbsoluteURLForMeta function.
  2. it does not deal with branches - if an author page a url with custombranch--repo--owner.hlx.page, it is not considered - this could be an improvement of the current method
  3. because rewriteUrls is used in different places, I did not want to interfere and take the risk to change some behaviour

I can re-try with rewriteUrls and see the impact.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it would probably be better to apply the rewriting in render, as the logic in metadata is already a bit complex:

for (const [name, value] of Object.entries(meta.page)) {
const attr = name.includes(':') && !name.startsWith('twitter:') ? 'property' : 'name';
if (Array.isArray(value)) {
for (const v of value) {
appendElement($head, createElement('meta', attr, name, 'content', v));
}
} else {
appendElement($head, createElement('meta', attr, name, 'content', value));
}
}
appendElement($head, createElement('link', 'rel', 'alternate', 'type', 'application/xml+atom', 'href', meta.feed, 'title', `${meta.title} feed`));

But I assume we do not get the values from the metadata sheet in here... no ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dev branch for testing can be easily achieved in the client side code consuming the metadata: it needs the path only anyway and use the current host. And it is already the case today: the code cannot assume it gets the correct host anyway if the content is managed by an author.
The only problem that this solves is authors copy/pasting urls from any of the 3 hosts in the metadata of their Word / gdoc documents, as we tell them to do. If they paste a hlx.page url, we should just make sure it does not appear in production. All the rest is anyway out of scope and does to change here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I do not think this is a "BREAKING CHANGE", slight change in behaviour, yes, but you have a URL today, you still get a URL after this. Different host, yes but points to same "content" in prod.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra logic which is currently isolated in "my" makeAbsoluteURLForMeta function.

Ok, switching to relative links may be too breaking, but isn't similar behavior alreasy in makeCanonicalHtmlUrl today? I wouldn really try to reuse existing rewriting logic as much as possible.

Copy link
Author

@kptdobe kptdobe Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. makeCanonicalHtmlUrl is doing something else (only appending index.html where needed).
We have a large collection of methods doing similar but different things. That's why it is hard to plug on top without breaking backward compatibility.
Also rewriteUrl should be named makeRelative because this is really what it is doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I do not think this is a "BREAKING CHANGE",

yes it is. projects might expect .hlx.page or hlx.live urls, and might be not be able to deal with the production url.

} else {
// make absolute URLs for strings
metadata[name] = makeAbsoluteURLForMeta(state, metadata[name]);
}
}

Expand Down
34 changes: 34 additions & 0 deletions src/steps/utils.js
Original file line number Diff line number Diff line change
Expand Up @@ -223,3 +223,37 @@ export function toArray(v) {
}
return Array.isArray(v) ? v : [v];
}

/**
* Returns an updated value of the provided metadata value:
* if the metadata is a URL or an array of URLs (comma separated)
* and if the URL(s) contains the preview or live host,
* the URL(s) are replaced by the absolute URL(s) to the production host
* @param {PipelineState} state the request state
* @param {string} meta the metadata value
* @returns {string} meta the update metadata value or the original value
*/
export function makeAbsoluteURLForMeta(state, meta) {
if (meta && meta.startsWith('http')) {
const rewrite = [];
const split = meta.split(/\s*,\s*/).map((v) => v.trim());

const previewSuffix = state.previewHost?.replace(/^.*?--/, '--');
const liveSuffix = state.liveHost?.replace(/^.*?--/, '--');
split.forEach((v) => {
try {
const u = new URL(v);
if ((previewSuffix && u.host.endsWith(previewSuffix))
|| (liveSuffix && u.host.endsWith(liveSuffix))) {
rewrite.push(getAbsoluteUrl(state, `${u.pathname}${u.search}${u.hash}`));
} else {
rewrite.push(v);
}
} catch (e) {
rewrite.push(v);
}
});
Comment on lines +243 to +255
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, we need to be careful not to change the behaviour of the canonical url until we decided how it should be handled.

Copy link
Author

@kptdobe kptdobe Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new code only makes sure no ref--repo--owner.hlx.page|live appears in meta on production, it does not change the canonical computation logic.

return rewrite.join(', ');
}
return meta;
}
31 changes: 31 additions & 0 deletions test/fixtures/content/page-metadata-block-urls.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
<head>
<title>ACME CORP</title>
<link rel="canonical" href="https://www.adobe.com/page-metadata-block-urls">
<meta name="description" content="Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed euismod, urna eu tempor congue, nisi erat condimentum nunc, eget tincidunt nisl nunc euismod.">
<meta property="og:title" content="ACME CORP">
<meta property="og:description" content="Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed euismod, urna eu tempor congue, nisi erat condimentum nunc, eget tincidunt nisl nunc euismod.">
<meta property="og:url" content="https://www.adobe.com/page-metadata-block-urls">
<meta property="og:image" content="https://www.adobe.com/default-meta-image.png?width=1200&#x26;format=pjpg&#x26;optimize=medium">
<meta property="og:image:secure_url" content="https://www.adobe.com/default-meta-image.png?width=1200&#x26;format=pjpg&#x26;optimize=medium">
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:title" content="ACME CORP">
<meta name="twitter:description" content="Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed euismod, urna eu tempor congue, nisi erat condimentum nunc, eget tincidunt nisl nunc euismod.">
<meta name="twitter:image" content="https://www.adobe.com/default-meta-image.png?width=1200&#x26;format=pjpg&#x26;optimize=medium">
<meta name="locale" content="en-US">
<meta name="zero-cell" content="0">
<meta name="page-url-branch" content="https://www.adobe.com/page.html">
<meta name="live-url-branch" content="https://www.adobe.com/live.html">
<meta name="page-url-main" content="https://www.adobe.com/page.html">
<meta name="live-url-main" content="https://www.adobe.com/live.html">
<meta name="prod-url" content="https://www.adobe.com/prod-page.html">
<meta name="hlx-page-url" content="https://www.adobe.com/express/">
<meta name="hlx-page-urls-1" content="https://www.adobe.com/express/page1, https://www.adobe.com/express/page2">
<meta name="hlx-page-urls-2" content="https://www.adobe.com/express/page3, https://www.adobe.com/express/page4">
<meta name="hlx-page-urls-3" content="https://www.adobe.com/express/page5, https://www.adobe.com/express/page6">
<meta name="hlx-live-url" content="https://www.adobe.com/express/">
<meta name="branch-hlx-live-url" content="https://www.adobe.com/express/">
<link id="favicon" rel="icon" type="image/svg+xml" href="/icons/spark.svg">
<meta name="viewport" content="width=device-width, initial-scale=1">
<script src="/scripts.js" type="module"></script>
<link rel="stylesheet" href="/styles.css">
</head>
40 changes: 40 additions & 0 deletions test/fixtures/content/page-metadata-block-urls.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Metadata URLs rewrite Test

<table>
<tr>
<td>Metadata</td>
</tr>
<tr>
<td>hlx.page url</td>
<td><a href="https://main--helix-pages--adobe.hlx.page/express/">https://main--helix-pages--adobe.hlx.page/express/</a></td>
</tr>
<tr>
<td>hlx.page urls 1</td>
<td>
<p><a href="https://main--helix-pages--adobe.hlx.page/express/page1">https://main--helix-pages--adobe.hlx.page/express/page1</a></p>
<p><a href="https://main--helix-pages--adobe.hlx.page/express/page2">https://main--helix-pages--adobe.hlx.page/express/page2</a></p></td>
</tr>
<tr>
<td>hlx.page urls 2</td>
<td>
<p><a href="https://main--helix-pages--adobe.hlx.page/express/page3">https://main--helix-pages--adobe.hlx.page/express/page3</a>&nbsp;</p>
<p><a href="https://main--helix-pages--adobe.hlx.page/express/page4">https://main--helix-pages--adobe.hlx.page/express/page4</a>&nbsp;</p></td>
</tr>
<tr>
<td>hlx.page urls 3</td>
<td>
<ul>
<li><a href="https://main--helix-pages--adobe.hlx.page/express/page5">https://main--helix-pages--adobe.hlx.page/express/page5</a></li>
<li><a href="https://main--helix-pages--adobe.hlx.page/express/page6">https://main--helix-pages--adobe.hlx.page/express/page6</a></li>
</ul>
</td>
</tr>
<tr>
<td>hlx.live url</td>
<td><a href="https://main--helix-pages--adobe.hlx.live/express/">https://main--helix-pages--adobe.hlx.page/express/</a></td>
</tr>
<tr>
<td>branch hlx.live url</td>
<td><a href="https://another-ref--helix-pages--adobe.hlx.live/express/">https://another-ref--helix-pages--adobe.hlx.page/express/</a></td>
</tr>
</table>
29 changes: 29 additions & 0 deletions test/rendering.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -398,6 +398,35 @@ describe('Rendering', () => {
config = DEFAULT_CONFIG_EMPTY;
await testRender('page-metadata-twitter-fallback', 'head');
});

it('renders meta with URLs - rewrites urls', async () => {
config = structuredClone(DEFAULT_CONFIG);
config.cdn.preview = {
host: '$ref--$repo--$owner.hlx.page',
};
config.cdn.live = {
host: '$ref--$site--$org.hlx.live',
};

config.metadata.live.data['/**'] = config.metadata.live.data['/**'].concat([{
key: 'page-url-branch',
value: 'https://super-test--helix-pages--adobe.hlx.page/page.html',
}, {
key: 'live-url-branch',
value: 'https://super-test--helix-pages--adobe.hlx.live/live.html',
}, {
key: 'page-url-main',
value: 'https://main--helix-pages--adobe.hlx.page/page.html',
}, {
key: 'live-url-main',
value: 'https://main--helix-pages--adobe.hlx.live/live.html',
}, {
key: 'prod-url',
value: 'https://www.adobe.com/prod-page.html',
}]);

await testRender('page-metadata-block-urls', 'head');
});
});

describe('Miscellaneous', () => {
Expand Down
59 changes: 58 additions & 1 deletion test/steps/utils.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,13 @@ import assert from 'assert';

import {
getAbsoluteUrl,
getOriginalHost, makeCanonicalHtmlUrl,
getOriginalHost,
makeCanonicalHtmlUrl,
optimizeImageURL,
rewriteUrl,
toBlockCSSClassNames,
toArray,
makeAbsoluteURLForMeta,
} from '../../src/steps/utils.js';

describe('Optimize Image URLs', () => {
Expand Down Expand Up @@ -199,3 +201,58 @@ describe('to array test', () => {
assert.deepStrictEqual(toArray(null), []);
});
});

describe('Make absolute URLs for meta test', () => {
const SAMPLE_STATE = {
prodHost: 'www.my.com',
previewHost: 'main--repo--owner.my.page',
liveHost: 'main--repo--owner.my.live',
};

it('returns input for falsy', () => {
const test = (state) => {
assert.strictEqual(makeAbsoluteURLForMeta(state, null), null);
assert.strictEqual(makeAbsoluteURLForMeta(state, ''), '');
assert.strictEqual(makeAbsoluteURLForMeta(state, undefined), undefined);
};
test({});
test(SAMPLE_STATE);
});

it('keeps non-url meta', () => {
const test = (state) => {
assert.strictEqual(makeAbsoluteURLForMeta(state, 'This can be a title'), 'This can be a title');
assert.strictEqual(makeAbsoluteURLForMeta(state, 'Tag 1, Tag 2'), 'Tag 1, Tag 2');
};
test({});
test(SAMPLE_STATE);
});

it('keeps "external" urls', () => {
const test = (state) => {
assert.strictEqual(makeAbsoluteURLForMeta(state, 'https://www.hlx.page/docs'), 'https://www.hlx.page/docs');
assert.strictEqual(makeAbsoluteURLForMeta(state, 'https://www.aem.live/docs'), 'https://www.aem.live/docs');
assert.strictEqual(makeAbsoluteURLForMeta(state, 'https://admin.hlx.live/api'), 'https://admin.hlx.live/api');
assert.strictEqual(makeAbsoluteURLForMeta(state, 'https://www.my.com'), 'https://www.my.com');
assert.strictEqual(makeAbsoluteURLForMeta(state, 'https://www.sample.com/a/b/c/page.html?p=v#anchor'), 'https://www.sample.com/a/b/c/page.html?p=v#anchor');
assert.strictEqual(makeAbsoluteURLForMeta(state, 'https://main--another-repo--owner.my.page/a/b/c/page.html'), 'https://main--another-repo--owner.my.page/a/b/c/page.html');
assert.strictEqual(makeAbsoluteURLForMeta(state, 'https://main--another-repo--owner.my.live/a/b/c/page.html'), 'https://main--another-repo--owner.my.live/a/b/c/page.html');
};
test({});
test(SAMPLE_STATE);
});

it('deals with invalid urls', () => {
assert.strictEqual(makeAbsoluteURLForMeta(SAMPLE_STATE, 'httpwww.sample.com/a/b/c/page.html'), 'httpwww.sample.com/a/b/c/page.html');
});

it('makes absolute urls', () => {
assert.strictEqual(makeAbsoluteURLForMeta(SAMPLE_STATE, 'https://main--repo--owner.my.page/a/b/c/page.html'), 'https://www.my.com/a/b/c/page.html');
assert.strictEqual(makeAbsoluteURLForMeta(SAMPLE_STATE, 'https://main--repo--owner.my.live/a/b/c/page.html'), 'https://www.my.com/a/b/c/page.html');
assert.strictEqual(makeAbsoluteURLForMeta(SAMPLE_STATE, 'https://anotherbranch--repo--owner.my.page/a/b/c/page.html'), 'https://www.my.com/a/b/c/page.html');
assert.strictEqual(makeAbsoluteURLForMeta(SAMPLE_STATE, 'https://main--repo--owner.my.page/a/b/c/page.html?p1=v1&p2=v2'), 'https://www.my.com/a/b/c/page.html?p1=v1&p2=v2');
assert.strictEqual(makeAbsoluteURLForMeta(SAMPLE_STATE, 'https://main--repo--owner.my.page/a/b/c/page.html#anchor'), 'https://www.my.com/a/b/c/page.html#anchor');
assert.strictEqual(makeAbsoluteURLForMeta(SAMPLE_STATE, 'https://main--repo--owner.my.page/a/b/c/page.html?p1=v1&p2=v2#anchor'), 'https://www.my.com/a/b/c/page.html?p1=v1&p2=v2#anchor');
assert.strictEqual(makeAbsoluteURLForMeta(SAMPLE_STATE, 'https://main--repo--owner.my.page/a/b/c/page.html, https://main--repo--owner.my.live/d/e/f/page.html, combo, https://www.sample.com, https://main--repo--owner.my.page/'), 'https://www.my.com/a/b/c/page.html, https://www.my.com/d/e/f/page.html, combo, https://www.sample.com, https://www.my.com/');
});
});