Skip to content

Creating an Importer

An importer is a vital component of your plugin that facilitates the seamless transfer of data between your source and the Newsteam platform. It implements the wirebucket interface, which defines essential methods for importing, processing, and integrating data. By implementing these functions, you ensure smooth interaction with the Newsteam system, enabling efficient data handling and making your plugin a reliable data source within the platform.

The importer interacts with various components, including the GetEnv, GetLogfiles, and ProcessLogfile functions, each designed to handle specific tasks like data retrieval and transformation. This section will guide you through implementing these functions and integrating your data source with Newsteam.

Your importer must implement the following functions defined in the wirebucket interface to ensure proper data integration with Newsteam. These functions are:

  • GetEnv: Advertises the capabilities of your plugin.
  • GetLogfiles: Retrieves raw data from your data source.
  • ProcessLogfile: Processes raw log data into structured articles.

Each function plays a unique role in the importer’s operation, from initialization and metadata advertisement to data retrieval and transformation.

The GetEnv function is the entry point for informing Newsteam about the capabilities of your plugin. It returns a structured GetEnvResponse detailing the features your importer supports and optionally provides metadata about publications your plugin can integrate with.

The primary purpose of GetEnv is to:

  1. Advertise Plugin Capabilities: Declare which types of data your importer can handle (e.g., articles, images, videos).
  2. List Publications (Optional): Provide a list of publications supported by your plugin, including their names, IDs, and additional metadata like logos and menu items.

The function must return an instance of GetEnvResponse, which includes:

  • Capabilities (WireCapabilities): A set of boolean flags indicating supported features.
  • Publications: An optional array of publication metadata.

Below are example implementations:

public Task<GetEnvResponse> GetEnv(GetEnvRequest request)
{
return Task.FromResult(new GetEnvResponse
{
Capabilities = new WireCapabilities
{
Article = true, // Indicates that the importer supports articles.
Image = false, // Does not support fetching images.
Video = false, // Does not support fetching videos.
Audio = false, // Does not support fetching audio.
Authentication = false // Does not support authentication integration.
},
Publications = new List<Publication>
{
new Publication
{
Id = "pub-001",
Name = "Tech Times",
Description = "Your go-to source for technology news.",
Font = "Arial",
Colors = new List<int> { 0x123456, 0x654321 }
}
}
});
}

The GetLogfiles function is responsible for retrieving raw log files from your data source. These files are usually in binary format and need to be returned as an array.

The primary responsibilities of the GetLogfiles function are:

  1. Data Retrieval: Fetch log files from a specified location in the data source. These files are typically binary and must be returned as an array.
  2. State Management: Use a Cursor object to indicate the current position in the data source, enabling efficient and resumable data fetching.
  • Input: A Cursor object that specifies the starting position for fetching log files. The cursor contains metadata such as the source ID, position, and additional state information.
  • Output: An array of log files represented as binary data.

The Cursor object helps manage the state of log file retrieval and provides metadata such as:

  • id: Unique identifier for the cursor.
  • bucketId: Specifies the data source or bucket.
  • seekDate / seekPos: Indicates where to start fetching logs.
  • status / error: Tracks the cursor’s operational status.

This allows the GetLogfiles function to resume from a specific point in case of failure or pagination.

The GetLogfiles function typically follows these steps:

  1. Validate the Cursor: Ensure that the provided cursor is valid and points to a valid data source location.
  2. Fetch Data: Retrieve log files starting from the position indicated by the cursor. This may involve reading from a database, a file system, or an API.
  3. Return Results: Return the fetched log files as an array of binary data.
  • Simulate fetching two log files:
    • Log file 1: Contains binary data [0x01, 0x02, 0x03].
    • Log file 2: Contains binary data [0x04, 0x05, 0x06].
  • Return these as part of the response.

The function might return an array of log files encoded as binary data:

[
[1, 2, 3],
[4, 5, 6]
]

The ProcessLogfile function is responsible for processing a single log file, extracting meaningful information, and transforming it into a structured format suitable for further use. This function plays a pivotal role in converting raw data into articles.


The primary responsibilities of the ProcessLogfile function are:

  1. Log File Parsing: Analyze and extract relevant information from the provided log file content.
  2. Data Transformation: Convert the raw log file data into structured articles.
  3. Integration with Buckets: Use the associated bucket information to contextualize the transformation process.

  • Input:
    1. Bucket: A metadata object representing the context or source of the log file. It typically includes information about the data source, such as its identifier or configuration.
    2. Content: The binary data (Uint8Array or equivalent) representing the raw log file to be processed.

  • Output: An array of structured Article objects or similar entities. Each Article represents a meaningful unit of information extracted from the log file.

The Article object is the structured output of the processing step. It contains the following key fields:

  • Identifiers:
    • id: Unique identifier for the article.
    • organizationId and shareId: Contextual identifiers for the organization or sharing scope.

  • Content Details:
    • Titles (title, title2, etc.): Multiple title options for the article.
    • summary and plainText: Summary and main content of the article.
    • keywords, tags, and groups: Metadata for categorization and searchability.

  • Additional Metadata:
    • created, modified, and published: Timestamps for lifecycle tracking.
    • status: Current state of the article (e.g., draft, published).
    • authors, creatorIds, and assigned: Information about contributors.

  • Advanced Features:
    • canonicalUrl: Canonical link for the article.
    • sections and relatedArticles: Structural relationships with other entities.

The typical workflow of the ProcessLogfile function is as follows:

  1. Parse Log File:
    • Extract raw data fields based on the log file structure.
    • Perform any necessary decoding or deserialization.

  1. Transform Data:
    • Map extracted data fields into the Article structure.
    • Populate metadata fields such as title, tags, and status.

  1. Return Results:
    • Generate and return a collection of Article objects.
    • Handle empty or invalid log files gracefully by returning an empty array.

If a log file contains information for two articles, the function may produce the following output:

[
{
"id": "article-1",
"title": "First Article Title",
"summary": "Summary of the first article.",
"plainText": "Full content of the first article.",
"created": 1672531200,
"tags": ["news", "feature"],
"status": "published"
},
{
"id": "article-2",
"title": "Second Article Title",
"summary": "Summary of the second article.",
"plainText": "Full content of the second article.",
"created": 1672531300,
"tags": ["editorial"],
"status": "draft"
}
]