Skip to content

Separate SML extraction and scanner from main code #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 16 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,29 +24,28 @@ The goal of this library is to support and showcase multiple possible approaches
To create structured email messages, simply use the generator to create a MIME message with structured data included in the HTML body via `<script>` tag:

```java
import com.audriga.jakarta.sml.model.StructuredData;
import com.audriga.jakarta.sml.mime.StructuredMimeMessageWrapper;
import com.audriga.jakarta.sml.mime.InlineHtmlMessageBuilder;
import com.audriga.jakarta.sml.h2lj.model.StructuredData;
import com.audriga.jakarta.sml.extension.mime.StructuredMimeMessageWrapper;
import com.audriga.jakarta.sml.extension.mime.InlineHtmlMessageBuilder;
import jakarta.mail.MessagingException;

import java.util.ArrayList;
import java.util.Collections;

public class Example {

public static void main(String[] args) throws MessagingException {

// Comment email content elements
String emailSubject = "My first structured email";
String textEmailBody = "This is a test email";
String htmlEmailBody = "<html><body>This is a <b>test email</b></body></html>";

// Structured data
String jsonLd = "{\r\n \"@context\": \"http://schema.org\",\r\n \"@type\": \"EventReservation\",\r\n \"reservationId\": \"MBE12345\",\r\n \"underName\": {\r\n \"@type\": \"Person\",\r\n \"name\": \"Noah Baumbach\"\r\n },\r\n \"reservationFor\": {\r\n \"@type\": \"Event\",\r\n \"name\": \"Make Better Email 2024\",\r\n \"startDate\": \"2024-10-15\",\r\n \"organizer\": {\r\n \"@type\": \"Organization\",\r\n \"name\": \"Fastmail Pty Ltd.\",\r\n \"logo\": \"https://www.fastmail.com/assets/images/FM-Logo-RGB-IiFj8alCx1-3073.webp\"\r\n },\r\n \"location\": {\r\n \"@type\": \"Place\",\r\n \"name\": \"Isode Ltd\",\r\n \"address\": {\r\n \"@type\": \"PostalAddress\",\r\n \"streetAddress\": \"14 Castle Mews\",\r\n \"addressLocality\": \"Hampton\",\r\n \"addressRegion\": \"Greater London\",\r\n \"postalCode\": \"TW12 2NP\",\r\n \"addressCountry\": \"UK\"\r\n }\r\n }\r\n }\r\n}";

List<StructuredData> structuredDataList = new ArrayList<>();
structuredDataList.add(new StructuredData(jsonLd));

StructuredMimeMessageWrapper message = new InlineHtmlMessageBuilder()
.subject(emailSubject)
.textBody(textEmailBody)
Expand All @@ -64,19 +63,17 @@ public class Example {
To parse structured email messages, you can use the provided classes and methods to extract structured data from the email content.

```java
import com.audriga.jakarta.sml.mime.StructuredMimeMessageWrapper;
import com.audriga.jakarta.sml.extension.mime.StructuredMimeMessageWrapper;
import com.audriga.jakarta.sml.parser.StructuredEmailParser;
import com.audriga.jakarta.sml.model.StructuredData;
import com.audriga.jakarta.sml.h2lj.model.StructuredData;
import jakarta.mail.internet.MimeMessage;

import java.util.List;

public class Example {

public static void main(String[] args) throws Exception {

MimeMessage message = ... // obtain a MimeMessage instance

StructuredMimeMessageWrapper structuredMessage = new StructuredMimeParser().parseMessage(message);

for (StructuredData data : structuredMessage.getStructuredData()) {
Expand All @@ -88,7 +85,7 @@ public class Example {

### Further examples

For more complete examples, see the [MailProcessingTest](test/com/audriga/jakarta/sml/test/MailProcessingTest.java) class, which demonstrates parsing and creating mails.
For more complete examples, see the [MailProcessingTest](test/com/audriga/jakarta/sml/extension/MailProcessingTest.java) class, which demonstrates parsing and creating mails.

## Building

Expand All @@ -115,6 +112,11 @@ This project contains an IMAP account scanner command line tool, which can be us

The folder `test/resources/eml` contains several example files generated with this library. Refer to the [example files documentation](docs/example-files.md) for more information.

### H2LJ

Part of the source code of this project can be built as a separate JAR. This can be used to extract structured data
from HTML input. It is called Html2JSONLD (Java). See the [H2LJ documentation](docs/h2lj.md) for details.

## Contributing

Contributions are welcome! Please open new issues or pull requests on GitHub.
Expand Down
16 changes: 15 additions & 1 deletion build.xml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
</copy>
</target>

<target name="jar" depends="resolve,compile" description="Create a jar file">
<target name="jar-scanner" depends="resolve,compile" description="Create a jar file">
<mkdir dir="${dist-dir}"/>
<jar jarfile="${dist-dir}/__temp.jar">
<zipgroupfileset dir="${lib-source}">
Expand All @@ -42,6 +42,20 @@
</jar>
</target>

<target name="jar-h2lj" depends="resolve,compile" description="Create a jar file">
<mkdir dir="${dist-dir}"/>
<jar jarfile="${dist-dir}/h2lj.jar" basedir="${bin}">
<fileset dir="${bin}" includes="src/com/audriga/jakarta/sml/h2lj/**/*.class" />
</jar>
</target>

<target name="jar" depends="resolve,compile" description="Create a jar file">
<mkdir dir="${dist-dir}"/>
<jar jarfile="${dist-dir}/jakarta-structured-email.jar" basedir="${bin}">
<fileset dir="${bin}" includes="src/com/audriga/jakarta/sml/h2lj/**/*.class" />
<fileset dir="${bin}" includes="src/com/audriga/jakarta/sml/extension/**/*.class" />
</jar>
</target>

<target name="clean" description="Cleans this project">
<delete dir="${bin}" failonerror="false" />
Expand Down
12 changes: 12 additions & 0 deletions docs/h2lj.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# H2LJ

## Overview

H2LJ is a Java-based library designed to extract structured data from HTML.

## Building the Project

To compile the source code and create the JAR files, run:
```sh
ant jar-h2lj
```
10 changes: 7 additions & 3 deletions docs/scanner.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,15 @@ This is a command line tool for scanning an IMAP account.

Its goal is to **find existing messages which contain Schema.org markup** (JSON-LD or Microdata) and to optionally dump the findings as JSON-LD.

**Please consider donating test data** (in anonymized/pseudonomized form) to the [schema-org-examples dataset](https://github.com/audriga/schema-org-examples/).
**Please consider donating test data** (in anonymized/pseudonymous form) to the [schema-org-examples dataset](https://github.com/audriga/schema-org-examples/).

## Building

See build instructions in the [main README](../README.md#Building)
To build the project, use the following command to create the `sml-account-scan.jar` file under `dist/`:

```shell
ant jar-scanner
```

## Running

Expand Down Expand Up @@ -37,7 +41,7 @@ See also additional config options below.

### Using with FastMail/Gmail/Microsoft accounts

Some email providers, such as FastMail, Google, and Microsoft recommend OAuth as the default authentication mechnanism. Since this scanner currently does not support OAuth, you can alternatively set up a so-called "app-specific passwords" for those providers.
Some email providers, such as FastMail, Google, and Microsoft recommend OAuth as the default authentication mechanism. Since this scanner currently does not support OAuth, you can alternatively set up a so-called "app-specific passwords" for those providers.

Please see the corresponding provider documentation for details:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
package com.audriga.jakarta.sml.mime;
package com.audriga.jakarta.sml.extension.mime;

import com.audriga.jakarta.sml.model.MimeTextContent;
import com.audriga.jakarta.sml.model.StructuredData;
import com.audriga.jakarta.sml.model.StructuredSyntax;
import com.audriga.jakarta.sml.extension.model.MimeTextContent;
import com.audriga.jakarta.sml.h2lj.model.StructuredData;
import com.audriga.jakarta.sml.h2lj.model.StructuredSyntax;
import jakarta.activation.DataHandler;
import jakarta.mail.Address;
import jakarta.mail.MessagingException;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
package com.audriga.jakarta.sml.mime;
package com.audriga.jakarta.sml.extension.mime;

import com.audriga.jakarta.sml.model.MimeTextContent;
import com.audriga.jakarta.sml.extension.model.MimeTextContent;
import jakarta.mail.Message;
import jakarta.mail.MessagingException;
import jakarta.mail.Multipart;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
package com.audriga.jakarta.sml.mime;
package com.audriga.jakarta.sml.extension.mime;

import com.audriga.jakarta.sml.model.MimeTextContent;
import com.audriga.jakarta.sml.model.StructuredData;
import com.audriga.jakarta.sml.extension.model.MimeTextContent;
import com.audriga.jakarta.sml.h2lj.model.StructuredData;
import jakarta.mail.Message;
import jakarta.mail.MessagingException;

Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
package com.audriga.jakarta.sml.mime;
package com.audriga.jakarta.sml.extension.mime;

import com.audriga.jakarta.sml.model.MimeTextContent;
import com.audriga.jakarta.sml.model.StructuredData;
import com.audriga.jakarta.sml.extension.model.MimeTextContent;
import com.audriga.jakarta.sml.h2lj.model.StructuredData;
import jakarta.mail.Message;
import jakarta.mail.MessagingException;
import jakarta.mail.Session;

import java.util.Properties;

public class InlineHtmlMessageBuilder extends AbstractMessageBuilder<InlineHtmlMessageBuilder> {

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
package com.audriga.jakarta.sml.mime;
package com.audriga.jakarta.sml.extension.mime;

import com.audriga.jakarta.sml.model.MimeTextContent;
import com.audriga.jakarta.sml.model.StructuredData;
import com.audriga.jakarta.sml.extension.model.MimeTextContent;
import com.audriga.jakarta.sml.h2lj.model.StructuredData;
import jakarta.activation.DataHandler;
import jakarta.mail.MessagingException;
import jakarta.mail.internet.MimeBodyPart;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
package com.audriga.jakarta.sml.mime;
package com.audriga.jakarta.sml.extension.mime;

import com.audriga.jakarta.sml.model.MimeTextContent;
import com.audriga.jakarta.sml.model.StructuredData;
import com.audriga.jakarta.sml.extension.model.MimeTextContent;
import com.audriga.jakarta.sml.h2lj.model.StructuredData;
import jakarta.mail.Message;
import jakarta.mail.MessagingException;

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
package com.audriga.jakarta.sml.mime;
package com.audriga.jakarta.sml.extension.mime;

import com.audriga.jakarta.sml.model.MimeTextContent;
import com.audriga.jakarta.sml.model.StructuredData;
import com.audriga.jakarta.sml.extension.model.MimeTextContent;
import com.audriga.jakarta.sml.h2lj.model.StructuredData;
import jakarta.mail.Message;
import jakarta.mail.MessagingException;
import jakarta.mail.internet.MimeMultipart;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
package com.audriga.jakarta.sml.mime;
package com.audriga.jakarta.sml.extension.mime;

import com.audriga.jakarta.sml.model.StructuredData;
import com.audriga.jakarta.sml.h2lj.model.StructuredData;
import jakarta.activation.ActivationDataFlavor;
import jakarta.activation.DataSource;
import org.eclipse.angus.mail.handlers.text_plain;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
package com.audriga.jakarta.sml.mime;
package com.audriga.jakarta.sml.extension.mime;

import com.audriga.jakarta.sml.model.StructuredData;
import com.audriga.jakarta.sml.h2lj.model.StructuredData;
import jakarta.activation.DataContentHandler;
import jakarta.activation.DataContentHandlerFactory;

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
package com.audriga.jakarta.sml.mime;
package com.audriga.jakarta.sml.extension.mime;

import com.audriga.jakarta.sml.model.MimeTextContent;
import com.audriga.jakarta.sml.model.StructuredData;
import com.audriga.jakarta.sml.extension.model.MimeTextContent;
import com.audriga.jakarta.sml.h2lj.model.StructuredData;
import jakarta.mail.*;
import jakarta.mail.internet.InternetAddress;
import jakarta.mail.internet.MimeMessage;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package com.audriga.jakarta.sml.model;
package com.audriga.jakarta.sml.extension.model;

public class MimeTextContent {
private String text;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
package com.audriga.jakarta.sml.parser;
package com.audriga.jakarta.sml.extension.parser;

import com.audriga.jakarta.sml.mime.StructuredMimeMessageWrapper;
import com.audriga.jakarta.sml.extension.sender.StructuredMimeParseUtils;
import com.audriga.jakarta.sml.h2lj.parser.StructuredDataExtractionUtils;
import com.audriga.jakarta.sml.extension.mime.StructuredMimeMessageWrapper;
import jakarta.mail.Session;
import jakarta.mail.internet.MimeMessage;

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
package com.audriga.jakarta.sml.sender;
package com.audriga.jakarta.sml.extension.sender;

import com.audriga.jakarta.sml.mime.*;
import com.audriga.jakarta.sml.model.StructuredData;
import com.audriga.jakarta.sml.extension.mime.*;
import com.audriga.jakarta.sml.h2lj.model.StructuredData;
import jakarta.mail.*;

import java.io.IOException;
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
package com.audriga.jakarta.sml.extension.sender;

import com.audriga.jakarta.sml.extension.mime.StructuredMimeMessageWrapper;
import com.audriga.jakarta.sml.extension.model.MimeTextContent;
import jakarta.mail.BodyPart;
import jakarta.mail.MessagingException;
import jakarta.mail.internet.ContentType;
import jakarta.mail.internet.MimeMessage;
import jakarta.mail.internet.MimeMultipart;

import java.io.IOException;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;

public class StructuredMimeParseUtils {
protected static final String TEXT = "text";
protected static final String TEXT_PLAIN = "text/plain";
protected static final String TEXT_ASCII = "text/ascii";
protected static final String TEXT_HTML = "text/html";

public static StructuredMimeMessageWrapper parseMessage(MimeMessage message) throws MessagingException, IOException {
StructuredMimeMessageWrapper smw = new StructuredMimeMessageWrapper(message);
MimeTextContent htmlContent = parseBody(message, Collections.singletonList(TEXT_HTML));
smw.setHtmlBody(htmlContent);
smw.setTextBody(parseBody(message, Arrays.asList(TEXT, TEXT_PLAIN, TEXT_ASCII)));

return smw;
}

public static MimeTextContent parseBody(MimeMessage message, List<String> mimeTypes) throws MessagingException, IOException {
for (String mimeType : mimeTypes) {
if (message.isMimeType(mimeType)) {
return new MimeTextContent((String) message.getContent(), message.getEncoding());
}
}
if (message.isMimeType("multipart/*")) {
MimeMultipart mimeMultipart = (MimeMultipart) message.getContent();
return getBodyFromMultipart(mimeMultipart, mimeTypes);
}
return null;
}

private static MimeTextContent getBodyFromMultipart(MimeMultipart mimeMultipart, List<String> mimeTypes) throws MessagingException, IOException {
for (int i = 0; i < mimeMultipart.getCount(); i++) {
BodyPart bodyPart = mimeMultipart.getBodyPart(i);
for (String mimeType : mimeTypes) {
if (bodyPart.isMimeType(mimeType)) {
String contentType = bodyPart.getContentType().toLowerCase();
ContentType contentTypeObject = new ContentType(contentType);
String charset = contentTypeObject.getParameter("charset");
return new MimeTextContent((String) bodyPart.getContent(), charset);
}
}
if (bodyPart.isMimeType("multipart/*")) {
MimeMultipart nestedMultipart = (MimeMultipart) bodyPart.getContent();
MimeTextContent body = getBodyFromMultipart(nestedMultipart, mimeTypes);
if (body != null) {
return body;
}
}
}
return null;
}
}
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package com.audriga.jakarta.sml.model;
package com.audriga.jakarta.sml.h2lj.model;

public enum StructuredContext {
SCHEMA_ORG("https://schema.org");
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package com.audriga.jakarta.sml.model;
package com.audriga.jakarta.sml.h2lj.model;

import org.json.JSONObject;
import org.jspecify.annotations.NonNull;
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package com.audriga.jakarta.sml.model;
package com.audriga.jakarta.sml.h2lj.model;

public enum StructuredSyntax {
JSON_LD,
Expand Down
Loading