kodexa

The Kodexa python client allows you to work with unstructured documents and the Kodexa platform to enabled Intelligent Document Automation.

Subpackages

Package Contents

Classes

Assistant

An assistant is a rich-API to allow you to work with a reactive content store or with an end user

AssistantContext

The Assistant Context provides a way to interact with additional services and capabilities

AssistantResponse

An assistant response allows you to provide the response from an assistant to a specific

FileHandleConnector

FolderConnector

UrlConnector

ContentEvent

ContentFeature

A feature allows you to capture almost any additional data or metadata and associate it with a ContentNode

ContentNode

A Content Node identifies a section of the document containing logical

Document

A Document is a collection of metadata and a set of content nodes.

DocumentActor

Provides the definition of an actor in a transition

DocumentMetadata

A flexible dict based approach to capturing metadata for the document

DocumentTransition

Provides the definition of a transition for a document, where a change was applied by an assistant, user or external process

SourceMetadata

Class for keeping track of the original source information for a

TransitionType

The type of transition

Taxonomy

Provides the taxonomy hierarchy that is used for content and document classification and labeling

Pipeline

A pipeline represents a way to bring together parts of the kodexa framework to solve a specific problem.

PipelineContext

Pipeline context is created when you create a pipeline and it provides a way to access information about the

PipelineStatistics

A set of statistics for the processed document

KodexaPlatform

The KodexaPlatform object allows you to work with an instance of the Kodexa platform, allow you to list, view and deploy

RemoteStep

Allows you to interact with a step that has been deployed in the Kodexa platform

RemotePipeline

Allow you to interact with a pipeline that has been deployed to an instance of Kodexa Platform

RemoteSession

A Session on the Kodexa platform for leveraging pipelines and services

KodexaClient

NodeTagCopy

The NodeTagCopy action allows you select nodes specified by the selector and create copies of the existing_tag (if it exists) with the new_tag_name.

NodeTagger

A node tagger allows you to provide a type and content regular expression and then

RollupTransformer

The rollup step allows you to decide how you want to collapse content in a document by removing nodes

TextParser

Parser to load a source file as a text document. The text from the document may be placed on the root ContentNode or on the root's child nodes (controlled by lines_as_child_nodes).

Functions

add_connector(connector)

get_connector(connector, source)

get_connectors()

Args:

get_source(document)

Attributes

registered_connectors

class kodexa.Assistant

An assistant is a rich-API to allow you to work with a reactive content store or with an end user that is working with set of content

process_event(event: kodexa.model.objects.BaseEvent, context: AssistantContext) AssistantResponse

The assistant will need to examine the event to determine if it wants to respond

The event will focus on a specific content object (that will be stored and available). Based on the metadata from the content object the assistant can then return a response which can include zero or more pipelines that it wishes to execute.

This pipelines will be run asynchronously and the result of the pipelines might well return as another event for the assistant

Parameters
  • event – BaseEvent: the event being provided to the assistant

  • context – AssistantContext: the context for the assistant

Returns

the response to the event

Return type

AssistantResponse

class kodexa.AssistantContext(metadata: AssistantMetadata, path_to_kodexa_metadata: str = 'kodexa.yml', stores=None, content_provider=None, extension_pack_util=None)

The Assistant Context provides a way to interact with additional services and capabilities while processing an event

get_content(content_object: kodexa.model.ContentObject)

Puts a content object using the content provider

Parameters

content_object – Content Object to put

put_content(content_object: kodexa.model.ContentObject, content)

Puts the content object and its content based through the content provider

Parameters
  • content_object – The content object

  • content – the content

get_step(step: str, options=None)

Returns an instance of a step that is packaged in the same extension pack as the assistant, this allows you to build pipelines when you don’t know the owning organization

Parameters
  • step (str) – The step name (ie. pdf-parser)

  • options – A dictionary of the options to create the step

Returns

The step

get_store(event: ContentEvent) kodexa.platform.client.DocumentStoreEndpoint

Get a document store for the event (based on the document family ID)

Parameters

event – ContentEvent:

Returns

The instance of the document store

class kodexa.AssistantResponse(pipelines: List[AssistantPipeline] = None, text: Optional[str] = None, available_intents=None, output_document: kodexa.model.Document = None)

An assistant response allows you to provide the response from an assistant to a specific event.

pipelines

The list of pipelines that you wish to have executed against the content object from the event

text

The text that will be provided back to the user from the assistant

available_intents

Any available intentions that the assistant will further respond to

output_document

The output document, if the assistant has directly created one

class kodexa.FileHandleConnector(original_path)
static get_name()
static get_source(document)
Parameters

document

Returns:

__iter__()
__next__()
class kodexa.FolderConnector(path, file_filter='*', recursive=False, relative=False, caller_path=get_caller_dir(), unpack=False)
static get_name()
static get_source(document)
Parameters

document

Returns:

__iter__()
__next__()
__get_files__()
class kodexa.UrlConnector(original_path, headers=None)
static get_name()
static get_source(document)
Parameters

document

Returns:

__iter__()
__next__()
kodexa.add_connector(connector)
kodexa.get_connector(connector: str, source: kodexa.model.SourceMetadata)
kodexa.get_connectors()

Args:

Returns

return:

kodexa.get_source(document)
kodexa.registered_connectors :Dict[str, Type]
class kodexa.ContentEvent

Bases: kodexa.model.base.KodexaBaseModel

type :Optional[str]
content_object :Optional[ContentObject]
document_family :Optional[DocumentFamily]
object_event_type :Optional[ObjectEventType]
class kodexa.ContentFeature(feature_type: str, name: str, value: Any, single: bool = True)

Bases: object

A feature allows you to capture almost any additional data or metadata and associate it with a ContentNode

feature_type :str

The type of feature, a logical name to group feature types together (ie. spatial)

name :str

The name of the feature (ie. bbox)

value :Any

Description of the feature (Optional)

single :bool

Determines whether the data for this feature is a single instance or an array, if you have added the same feature to the same node you will end up with multiple data elements in the content feature and the single flag will be false

__str__()

Return str(self).

to_dict()

Create a dictionary representing this ContentFeature’s structure and content. :returns: The properties of this ContentFeature structured as a dictionary. :rtype: dict

>>> node.to_dict()
get_value()

Get the value from the feature. This method will handle the single flag

Returns

The value of the feature

class kodexa.ContentNode(document, node_type: str, content: Optional[str] = None, content_parts: Optional[List[Any]] = None, parent=None, index: Optional[int] = None, virtual: bool = False)

Bases: object

A Content Node identifies a section of the document containing logical grouping of information.

The node will have content and can include any number of features.

You should always create a node using the Document’s create_node method to ensure that the correct mixins are applied.

>>> new_page = document.create_node(node_type='page')
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>
>>> current_content_node.add_child(new_page)
>>> new_page = document.create_node(node_type='page', content='This is page 1')
<kodexa.model.model.ContentNode object at 0x7f80605e53c8>
>>> current_content_node.add_child(new_page)
node_type :str

The node type (ie. line, page, cell etc)

document :Document

The document that the node belongs to

_content_parts :Optional[List[Any]]

The children of the content node

index :Optional[int]

The index of the content node

uuid :Optional[int]

The ID of the content node

virtual :bool

Is the node virtual (ie. it doesn’t actually exist in the document)

get_content_parts()
set_content_parts(content_parts)
property content
__eq__(other)

Return self==value.

get_parent()
__str__()

Return str(self).

to_json()

Create a JSON string representation of this ContentNode.

Args:

Returns

The JSON formatted string representation of this ContentNode.

Return type

str

>>> node.to_json()
to_dict()

Create a dictionary representing this ContentNode’s structure and content.

Args:

Returns

The properties of this ContentNode and all of its children structured as a dictionary.

Return type

dict

>>> node.to_dict()
static from_dict(document, content_node_dict: addict.Dict, parent=None)

Build a new ContentNode from a dictionary represention.

Parameters
  • document (Document) – The Kodexa document from which the new ContentNode will be created (not added).

  • content_node_dict (Dict) – The dictionary-structured representation of a ContentNode. This value will be unpacked into a ContentNode.

  • parent (Optional[ContentNode]) – Optionally the parent content node

Returns

A ContentNode containing the unpacked values from the content_node_dict parameter.

Return type

ContentNode

>>> ContentNode.from_dict(document, content_node_dict)
add_child_content(node_type: str, content: str, index: Optional[int] = None) ContentNode

Convenience method to allow you to quick add a child node with a type and content

Parameters
  • node_type – the node type

  • content – the content

  • index – the index (optional) (Default value = None)

Returns

the new ContentNode

add_child(child, index: Optional[int] = None)

Add a ContentNode as a child of this ContentNode

Parameters
  • child (ContentNode) – The node that will be added as a child of this node

  • index (Optional[int]) – The index at which this child node should be added; defaults to None. If None, index is set as the count of child node elements.

Returns:

>>> new_page = document.create_node(node_type='page')
    <kodexa.model.model.ContentNode object at 0x7f80605e53c8>
    >>> current_content_node.add_child(new_page)
remove_child(content_node)
get_children()

Returns a list of the children of this node.

Returns

The list of child nodes for this ContentNode.

Return type

list[ContentNode]

>>> node.get_children()
set_feature(feature_type, name, value)

Sets a feature for this ContentNode, replacing the value if a feature by this type and name already exists.

Parameters
  • feature_type (str) – The type of feature to be added to the node.

  • name (str) – The name of the feature.

  • value (Any) – The value of the feature.

Returns

The feature that was added to this ContentNode

Return type

ContentFeature

>>> new_page = document.create_node(node_type='page')
   <kodexa.model.model.ContentNode object at 0x7f80605e53c8>
   >>> new_page.add_feature('pagination','pageNum',1)
add_feature(feature_type, name, value, single=True, serialized=False)

Add a new feature to this ContentNode.

Note: if a feature for this feature_type/name already exists, the new value will be added to the existing feature; therefore the feature value might become a list.

Parameters
  • feature_type (str) – The type of feature to be added to the node.

  • name (str) – The name of the feature.

  • value (Any) – The value of the feature.

  • single (boolean) – Indicates that the value is singular, rather than a collection (ex: str vs list); defaults to True.

  • serialized (boolean) – Indicates that the value is/is not already serialized; defaults to False.

Returns

The feature that was added to this ContentNode.

Return type

ContentFeature

>>> new_page = document.create_node(node_type='page')
   <kodexa.model.model.ContentNode object at 0x7f80605e53c8>
   >>> new_page.add_feature('pagination','pageNum',1)
delete_children(nodes: Optional[List] = None, exclude_nodes: Optional[List] = None)
Delete the children of this node, you can either supply a list of the nodes to delete

or the nodes to exclude from the delete, if neither are supplied then we delete all the children.

Note there is precedence in place, if you have provided a list of nodes to delete then the nodes to exclude is ignored.

Parameters
  • nodes – Optional[List[ContentNode]] a list of content nodes that are children to delete

  • exclude_nodes – Optional[List[ContentNode]] a list of content node that are children not to delete

  • nodes – Optional[List]: (Default value = None)

  • exclude_nodes – Optional[List]: (Default value = None)

get_feature(feature_type, name)

Gets the value for the given feature.

Parameters
  • feature_type (str) – The type of the feature.

  • name (str) – The name of the feature.

Returns

The feature with the specified type & name. If no feature is found, None is returned. Note that if there are more than one instance of the feature you will only get the first one

Return type

ContentFeature or None

>>> new_page.get_feature('pagination','pageNum')
   1
get_features_of_type(feature_type)

Get all features of a specific type.

Parameters

feature_type (str) – The type of the feature.

Returns

A list of feature with the specified type. If no features are found, an empty list is returned.

Return type

list[ContentFeature]

>>> new_page.get_features_of_type('my_type')
   []
has_feature(feature_type: str, name: str)

Determines if a feature with the given feature and name exists on this content node.

Parameters
  • feature_type (str) – The type of the feature.

  • name (str) – The name of the feature.

Returns

True if the feature is present; else, False.

Return type

bool

>>> new_page.has_feature('pagination','pageNum')
   True
get_features()

Get all features on this ContentNode.

Returns

A list of the features on this ContentNode.

Return type

list[ContentFeature]

remove_feature(feature_type: str, name: str, include_children: bool = False)

Removes the feature with the given name and type from this node.

Parameters
  • feature_type (str) – The type of the feature.

  • name (str) – The name of the feature.

  • include_children (bool) – also remove the feature from nodes children

>>> new_page.remove_feature('pagination','pageNum')
get_feature_value(feature_type: str, name: str) Optional[Any]

Get the value for a feature with the given name and type on this ContentNode.

Parameters
  • feature_type (str) – The type of the feature.

  • name (str) – The name of the feature.

Returns

The value of the feature if it exists on this ContentNode otherwise, None, note this only returns the first value (check single to determine if there are multiple)

Return type

Any or None

>>> new_page.get_feature_value('pagination','pageNum')
   1
get_feature_values(feature_type: str, name: str) Optional[List[Any]]

Get the value for a feature with the given name and type on this ContentNode.

Parameters
  • feature_type (str) – The type of the feature.

  • name (str) – The name of the feature.

Returns

The list of feature values or None if there is no feature

>>> new_page.get_feature_value('pagination','pageNum')
   1
get_content()

Get the content of this node.

Args:

Returns

The content of this ContentNode.

Return type

str

>>> new_page.get_content()
   "This is page one"
get_node_type()

Get the type of this node.

Args:

Returns

The type of this ContentNode.

Return type

str

>>> new_page.get_content()
   "page"
select_first(selector, variables=None)

Select and return the first child of this node that match the selector value.

Parameters
  • selector (str) – The selector (ie. //*)

  • variables (dict, optional) – A dictionary of variable name/value to use in substituion; defaults to None. Dictionary keys should match a variable specified in the selector.

Returns

The first matching node or none

Return type

Optional[ContentNode]

>>> document.get_root().select_first('.')
   ContentNode
>>> document.get_root().select_first('//*[hasTag($tagName)]', {"tagName": "div"})
   ContentNode
select(selector, variables=None)

Select and return the child nodes of this node that match the selector value.

Parameters
  • selector (str) – The selector (ie. //*)

  • variables (dict, optional) – A dictionary of variable name/value to use in substituion; defaults to None. Dictionary keys should match a variable specified in the selector.

Returns

A list of the matching content nodes. If no matches are found, the list will be empty.

Return type

list[ContentNode]

>>> document.get_root().select('.')
   [ContentNode]
>>> document.get_root().select('//*[hasTag($tagName)]', {"tagName": "div"})
   [ContentNode]
get_all_content(separator=' ', strip=True)

Get this node’s content, concatenated with all of its children’s content.

Parameters
  • separator (str, optional) – The separator to use in joining content together; defaults to ” “.

  • strip (boolean, optional) – Strip the result

Returns

The complete content for this node concatenated with the content of all child nodes.

Return type

str

>>> document.content_node.get_all_content()

“This string is made up of multiple nodes”

adopt_children(nodes_to_adopt, replace=False)

This will take a list of content nodes and adopt them under this node, ensuring they are re-parented.

Parameters
  • children (List[ContentNode]) – A list of ContentNodes that will be added to the end of this node’s children collection

  • replace (bool) – If True, will remove all current children and replace them with the new list; defaults to True

>>> # select all nodes of type 'line', then the root node 'adopts' them
    >>> # and replaces all it's existing children with these 'line' nodes.
    >>> document.get_root().adopt_children(document.select('//line'), replace=True)
remove_tag(tag_name)

Remove a tag from this content node.

Parameters
  • str – tag_name: The name of the tag that should be removed.

  • tag_name

Returns:

>>> document.get_root().remove_tag('foo')
set_statistics(statistics)

Set the spatial statistics for this node

Parameters

statistics – the statistics object

Returns:

>>> document.select.('//page')[0].set_statistics(NodeStatistics())
get_statistics()

Get the spatial statistics for this node

Returns

the statistics object (or None if not set)

Args:

Returns:

>>> document.select.('//page')[0].get_statistics()
    <kodexa.spatial.NodeStatistics object at 0x7f80605e53c8>
set_bbox(bbox)

Set the bounding box for the node, this is structured as:

[x1,y1,x2,y2]

Parameters

bbox – the bounding box array

>>> document.select.('//page')[0].set_bbox([10,20,50,100])
get_bbox()

Get the bounding box for the node, this is structured as:

[x1,y1,x2,y2]

Returns

the bounding box array

>>> document.select.('//page')[0].get_bbox()
    [10,20,50,100]
set_bbox_from_children()

Set the bounding box for this node based on its children

set_rotate(rotate)

Set the rotate of the node

Parameters

rotate – the rotation of the node

Returns:

>>> document.select.('//page')[0].set_rotate(90)
get_rotate()

Get the rotate of the node

Returns

the rotation of the node

Args:

Returns:

>>> document.select.('//page')[0].get_rotate()
    90
get_x()

Get the X position of the node

Returns

the X position of the node

Args:

Returns:

>>> document.select.('//page')[0].get_x()
    10
get_y()

Get the Y position of the node

Returns

the Y position of the node

Args:

Returns:

>>> document.select.('//page')[0].get_y()
    90
get_width()

Get the width of the node

Returns

the width of the node

Args:

Returns:

>>> document.select.('//page')[0].get_width()
    70
get_height()

Get the height of the node

Returns

the height of the node

Args:

Returns:

>>> document.select.('//page')[0].get_height()
    40
copy_tag(selector='.', existing_tag_name=None, new_tag_name=None)

Creates a new tag of ‘new_tag_name’ on the selected content node(s) with the same information as the tag with ‘existing_tag_name’. Both existing_tag_name and new_tag_name values are required and must be different from one another. Otherwise, no action is taken. If a tag with the ‘existing_tag_name’ does not exist on a selected node, no action is taken for that node.

Parameters
  • selector – The selector to identify the source nodes to work on (default . - the current node)

  • str – existing_tag_name: The name of the existing tag whose values will be copied to the new tag.

  • str – new_tag_name: The name of the new tag. This must be different from the existing_tag_name.

  • existing_tag_name – (Default value = None)

  • new_tag_name – (Default value = None)

Returns:

>>> document.get_root().copy_tag('foo', 'bar')
collect_nodes_to(end_node)

Get the the sibling nodes between the current node and the end_node.

Parameters
  • ContentNode – end_node: The node to end at

  • end_node

Returns

A list of sibling nodes between this node and the end_node.

Return type

list[ContentNode]

>>> document.content_node.get_children()[0].collect_nodes_to(end_node=document.content_node.get_children()[5])
tag_nodes_to(end_node, tag_to_apply, tag_uuid: str = None)

Tag all the nodes from this node to the end_node with the given tag name

Parameters
  • end_node (ContentNode) – The node to end with

  • tag_to_apply (str) – The tag name that will be applied to each node

  • tag_uuid (str) – The tag uuid used if you want to group them

>>> document.content_node.get_children()[0].tag_nodes_to(document.content_node.get_children()[5], tag_name='foo')
tag_range(start_content_re, end_content_re, tag_to_apply, node_type_re='.*', use_all_content=False)

This will tag all the child nodes between the start and end content regular expressions

Parameters
  • start_content_re – The regular expression to match the starting child

  • end_content_re – The regular expression to match the ending child

  • tag_to_apply – The tag name that will be applied to the nodes in range

  • node_type_re – The node type to match (default is all)

  • use_all_content – Use full content (including child nodes, default is False)

Returns:

>>> document.content_node.tag_range(start_content_re='.*Cheese.*', end_content_re='.*Fish.*', tag_to_apply='foo')
tag(tag_to_apply, selector='.', content_re=None, use_all_content=False, node_only=None, fixed_position=None, data=None, separator=' ', tag_uuid: str = None, confidence=None, value=None, use_match=True, index=None, cell_index=None, group_uuid=None, parent_group_uuid=None, note=None, status=None)

This will tag (see Feature Tagging) the expression groups identified by the regular expression.

Note that if you use the flag use_all_content then node_only will default to True if not set, else it will default to False

Parameters
  • tag_to_apply – The name of tag that will be applied to the node

  • selector – The selector to identify the source nodes to work on (default . - the current node)

  • content_re – The regular expression that you wish to use to tag, note that we will create a tag for each matching group (Default value = None)

  • use_all_content – Apply the regular expression to the all_content (include content from child nodes) (Default value = False)

  • separator – Separator to use for use_all_content (Default value = ” “)

  • node_only – Ignore the matching groups and tag the whole node (Default value = None)

  • fixed_position – Use a fixed position, supplied as a tuple i.e. - (4,10) tag from position 4 to 10 (default None)

  • data – A dictionary of data for the given tag (Default value = None)

  • tag_uuid

    A UUID used to tie tags in order to demonstrate they’re related and form a single concept. For example, if tagging the two words “Wells” and “Fargo” as an ORGANIZATION, the tag on both words should have the same tag_uuid in order to indicate they are both needed to form the single ORGANIZATION. If a tag_uuid is provided, it is used on all tags created in this method. This may result in multiple nodes or multiple feature values having the same tag_uuid. For example, if the selector provided results in more than one node being selected, each node would be tagged with the same tag_uuid. The same holds true if a content_re value is provided, node_only is set to False, and multiple matches are found for the content_re pattern. In that case, each feature value would share the same UUID. If no tag_uuid is provided, a new uuid is generated for each tag instance.

    tag_uuid: str: (Default value = None)

  • confidence – The confidence in the tag (0-1)

  • value – The value you wish to store with the tag, this allows you to provide text that isn’t part of the content but represents the data you wish tagged

  • use_match – If True (default) we will use match for regex matching, if False we will use search

  • index – The index for the tag

  • cell_index – The cell index for the tag

  • group_uuid – The group uuid for the tag

  • parent_group_uuid – The parent group uuid for the tag

  • note – a text note for the tag

  • status – a status for the tag, this can be transistioned to an attribute status during extraction

>>> document.content_node.tag('is_cheese')
get_tags()

Returns a list of the names of the tags on the given node

Returns

A list of the tag name

Args:

Returns:

>>> document.content_node.select('*').get_tags()
    ['is_cheese']
get_tag_features()

Returns a list of the features that are tags on the given node

Returns

A list of the tag name

Args:

Returns:

>>> document.content_node.select('*').get_tag_features()
    [ContentFeature()]
get_tag_values(tag_name, include_children=False)

Get the values for a specific tag name

Parameters
  • tag_name – tag name

  • include_children – include the children of this node (Default value = False)

Returns

a list of the tag values

Get the values for a specific tag name, grouped by uuid

Parameters
  • tag_name (str) – tag name

  • include_children (bool) – include the children of this node

  • value_separator (str) – the string to be used to join related tag values

Returns

a list of the tag values

Get the nodes for a specific tag name, grouped by uuid

Parameters
  • tag_name (str) – tag name

  • everywhere (bool) – include the children of this node

  • tag_uuid (optional(str)) – if set we will only get nodes related to this tag UUID

Returns

a dictionary that groups nodes by tag UUID

get_tag(tag_name, tag_uuid=None)

Returns the value of a tag (a dictionary), this can be either a single value in a list [[start,end,value]] or if multiple parts of the content of this node match you can end up with a list of lists i.e. [[start1,end1,value1],[start2,end2,value2]]

Parameters
  • tag_name – The name of the tag

  • tag_uuid (Optional) – Optionally you can also provide the tag UUID

Returns

A list tagged location and values for this label in this node

>>> document.content_node.select_first('//*[contentRegex(".*Cheese.*")]').get_tag('is_cheese')
    [0,10,'The Cheese Moved']
get_all_tags()

Get the names of all tags that have been applied to this node or to its children.

Args:

Returns

A list of the tag names belonging to this node and/or its children.

Return type

list[str]

>>> document.content_node.select_first('//*[contentRegex(".*Cheese.*")]').get_all_tags()
    ['is_cheese']
has_tags()

Determines if this node has any tags at all.

Args:

Returns

True if node has any tags; else, False;

Return type

bool

>>> document.content_node.select_first('//*[contentRegex(".*Cheese.*")]').has_tags()
    True
has_tag(tag, include_children=False)

Determine if this node has a tag with the specified name.

Parameters
  • tag (str) – The name of the tag.

  • include_children (bool) – should we include child nodes

Returns

True if node has a tag by the specified name; else, False;

Return type

bool

>>> document.content_node.select_first('//*[contentRegex(".*Cheese.*")]').has_tag('is_cheese')
    True
    >>> document.content_node.select_first('//*[contentRegex(".*Cheese.*")]').has_tag('is_fish')
    False
is_first_child()

Determines if this node is the first child of its parent or has no parent.

Args:

Returns

True if this node is the first child of its parent or if this node has no parent; else, False;

Return type

bool

is_last_child()

Determines if this node is the last child of its parent or has no parent.

Returns

True if this node is the last child of its parent or if this node has no parent; else, False;

Return type

bool

get_last_child_index()

Returns the max index value for the children of this node. If the node has no children, returns None.

Returns

The max index of the children of this node, or None if there are no children.

Return type

int or None

get_node_at_index(index)

Returns the child node at the specified index. If the specified index is outside the first (0), or last child’s index, None is returned.

Note: documents allow for sparse representation and child nodes may not have consecutive index numbers. If there isn’t a child node at the specfied index, a ‘virtual’ node will be returned. This ‘virtual’ node will have the node type of its nearest sibling and will have an index value, but will have no features or content.

Parameters

index (int) – The index (zero-based) for the child node.

Returns

Node at index, or None if the index is outside the boundaries of child nodes.

Return type

ContentNode or None

has_next_node(node_type_re='.*', skip_virtual=False)

Determine if this node has a next sibling that matches the type specified by the node_type_re regex.

Parameters
  • node_type_re (str, optional, optional) – The regular expression to match against the next sibling node’s type; default is ‘.*’.

  • skip_virtual (bool, optional, optional) – Skip virtual nodes and return the next real node; default is False.

Returns

True if there is a next sibling node matching the specified type regex; else, False.

Return type

bool

has_previous_node(node_type_re='.*', skip_virtual=False)

Determine if this node has a previous sibling that matches the type specified by the node_type_re regex.

Parameters
  • node_type_re (str, optional, optional) – The regular expression to match against the previous sibling node’s type; default is ‘.*’.

  • skip_virtual (bool, optional, optional) – Skip virtual nodes and return the next real node; default is False.

Returns

True if there is a previous sibling node matching the specified type regex; else, False.

Return type

bool

next_node(node_type_re='.*', skip_virtual=False, has_no_content=True)

Returns the next sibling content node.

Note: This logic relies on node indexes. Documents allow for sparse representation and child nodes may not have consecutive index numbers. Therefore, the next node might actually be a virtual node that is created to fill a gap in the document. You can skip virtual nodes by setting the skip_virtual parameter to False.

Parameters
  • node_type_re (str, optional, optional) – The regular expression to match against the next sibling node’s type; default is ‘.*’.

  • skip_virtual (bool, optional, optional) – Skip virtual nodes and return the next real node; default is False.

  • has_no_content (bool, optional, optional) – Allow a node that has no content to be returned; default is True.

Returns

The next node or None, if no node exists

Return type

ContentNode or None

previous_node(node_type_re='.*', skip_virtual=False, has_no_content=False, traverse=Traverse.SIBLING)

Returns the previous sibling content node.

Note: This logic relies on node indexes. Documents allow for sparse representation and child nodes may not have consecutive index numbers. Therefore, the previous node might actually be a virtual node that is created to fill a gap in the document. You can skip virtual nodes by setting the skip_virtual parameter to False.

Parameters
  • node_type_re (str, optional, optional) – The regular expression to match against the previous node’s type; default is ‘.*’.

  • skip_virtual (bool, optional, optional) – Skip virtual nodes and return the next real node; default is False.

  • has_no_content (bool, optional, optional) – Allow a node that has no content to be returned; default is False.

  • traverse (Traverse(enum), optional, optional) – The transition you’d like to traverse (SIBLING, CHILDREN, PARENT, or ALL); default is Traverse.SIBLING.

Returns

The previous node or None, if no node exists

Return type

ContentNode or None

class kodexa.Document(metadata=None, content_node: ContentNode = None, source=None, ref: str = None, kddb_path: str = None, delete_on_close=False)

Bases: object

A Document is a collection of metadata and a set of content nodes.

PREVIOUS_VERSION :str = 1.0.0
CURRENT_VERSION :str = 4.0.2
metadata :DocumentMetadata

Metadata relating to the document

_content_node :Optional[ContentNode]

The root content node

virtual :bool = False

Is the document virtual (deprecated)

_mixins :List[str] = []

A list of the mixins for this document

uuid :str

The UUID of this document

exceptions :List = []

A list of the exceptions on this document (deprecated)

log :List[str] = []

A log for this document (deprecated)

version

The version of the document

source :SourceMetadata

Source metadata for this document

labels :List[str] = []

A list of the document level labels for the document

taxonomies :List[str] = []

A list of the taxonomy references for this document

classes :List[ContentClassification] = []

A list of the content classifications associated at the document level

__str__()

Return str(self).

add_exception(exception: ContentException)
get_persistence()
get_all_tags()
get_tagged_nodes(tag_name, tag_uuid=None)
property content_node

The root content Node

add_classification(label: str, taxonomy_ref: Optional[str] = None) ContentClassification

Add a content classification to the document

Parameters
  • label (str) – the label

  • taxonomy_ref (Optional[str]) – the reference to the taxonomy

Returns

the content classification created (or the matching one if it is already on the document)

add_label(label: str)

Add a label to the document

Parameters
  • label – str Label to add

  • label – str:

Returns

the document

remove_label(label: str)

Remove a label from the document

Parameters
  • label – str Label to remove

  • label – str:

Returns

the document

classmethod from_text(text, separator=None)

Creates a new Document from the text provided.

Parameters
  • text – str Text to be used as content on the Document’s ContentNode(s)

  • separator – str If provided, this string will be used to split the text and the resulting text will be placed on children of the root ContentNode. (Default value = None)

Returns

the document

get_root()

Get the root content node for the document (same as content_node)

to_kdxa(file_path: str)

Write the document to the kdxa format (msgpack) which can be used with the Kodexa platform

Parameters
  • file_path – the path to the mdoc you wish to create

  • file_path – str:

Returns:

>>> document.to_mdoc('my-document.kdxa')
static open_kddb(file_path)

Opens a Kodexa Document Database.

This is the Kodexa V4 default way to store documents, it provides high-performance and also the ability to handle very large document objects

Parameters

file_path – The file path

Returns

The Document instance

close()

Close the document and clean up the resources

to_kddb(path=None)

Either write this document to a KDDB file or convert this document object structure into a KDDB and return a bytes-like object

This is dependent on whether you provide a path to write to

static from_kdxa(file_path)

Read an .kdxa file from the given file_path and

Parameters

file_path – the path to the mdoc file

Returns:

>>> document = Document.from_kdxa('my-document.kdxa')
to_msgpack()

Convert this document object structure into a message pack

to_json()

Create a JSON string representation of this Document.

Args:

Returns

The JSON formatted string representation of this Document.

Return type

str

>>> document.to_json()
to_dict()

Create a dictionary representing this Document’s structure and content.

Args:

Returns

A dictionary representation of this Document.

Return type

dict

>>> document.to_dict()
static from_dict(doc_dict)

Build a new Document from a dictionary.

Parameters
  • dict – doc_dict: A dictionary representation of a Kodexa Document.

  • doc_dict

Returns

A complete Kodexa Document

Return type

Document

>>> Document.from_dict(doc_dict)
static from_json(json_string)

Create an instance of a Document from a JSON string.

Parameters
  • str – json_string: A JSON string representation of a Kodexa Document

  • json_string

Returns

A complete Kodexa Document

Return type

Document

>>> Document.from_json(json_string)
static from_msgpack(msgpack_bytes)

Create an instance of a Document from a message pack byte array.

Parameters

msgpack_bytes – bytes: A message pack byte array.

Returns

A complete Kodexa Document

Return type

Document

>>> Document.from_msgpack(open(os.path.join('news-doc.kdxa'), 'rb').read())
get_mixins()

Get the list of mixins that have been enabled on this document

Returns

list[str] a list of the mixin names

Return type

mixins

add_mixin(mixin)

Add the given mixin to this document, this will apply the mixin to all the content nodes, and also register it with the document so that future invocations of create_node will ensure the node has the mixin appled.

Parameters

mixin – str the name of the mixin to add

Returns: >>> import * from kodexa >>> document = Document() >>> document.add_mixin(‘spatial’)

create_node(node_type: str, content: Optional[str] = None, virtual: bool = False, parent: ContentNode = None, index: Optional[int] = None)

Creates a new node for the document. The new node is not added to the document, but any mixins that have been applied to the document will also be available on the new node.

Parameters
  • node_type (str) – The type of node.

  • content (str) – The content for the node; defaults to None.

  • virtual (bool) – Indicates if this is a ‘real’ or ‘virtual’ node; default is False. ‘Real’ nodes contain document content. ‘Virtual’ nodes are synthesized as necessary to fill gaps in between non-consecutively indexed siblings. Such indexing arises when document content is sparse.

  • parent (ContentNode) – The parent for this newly created node; default is None;

  • index (Optional[int)) – The index property to be set on this node; default is 0;

Returns

This newly created node.

Return type

ContentNode

>>> document.create_node(node_type='page')
    <kodexa.model.model.ContentNode object at 0x7f80605e53c8>
classmethod from_kddb(source, detached: bool = False)

Loads a document from a Kodexa Document Database (KDDB) file

Parameters
  • input – if a string we will load the file at that path, if bytes we will create a temp file and load the KDDB to it

  • detached (bool) – if reading from a file we will create a copy so we don’t update in place

Returns

the document

classmethod from_file(file, unpack: bool = False)

Creates a Document that has a ‘file-handle’ connector to the specified file.

Parameters
  • file – file: The file to which the new Document is connected.

  • unpack – bool: (Default value = False)

Returns

A Document connected to the specified file.

Return type

Document

classmethod from_url(url, headers=None)

Creates a Document that has a ‘url’ connector for the specified url.

Parameters
  • str – url: The URL to which the new Document is connected.

  • dict – headers: Headers that should be used when reading from the URL

  • url

  • headers – (Default value = None)

Returns

A Document connected to the specified URL with the specified headers (if any).

Return type

Document

select_first(selector, variables=None) Optional[ContentNode]

Select and return the first child of this node that match the selector value.

Parameters
  • selector (str) – The selector (ie. //*)

  • variables (dict, optional) – A dictionary of variable name/value to use in substituion; defaults to None. Dictionary keys should match a variable specified in the selector.

Returns

The first matching node or none

Return type

Optional[ContentNode]

>>> document.get_root().select_first('.')
   ContentNode
>>> document.get_root().select_first('//*[hasTag($tagName)]', {"tagName": "div"})
   ContentNode
select(selector: str, variables: Optional[dict] = None) List[ContentNode]

Execute a selector on the root node and then return a list of the matching nodes.

Parameters
  • selector (str) – The selector (ie. //*)

  • variables (Optional[dict) – A dictionary of variable name/value to use in substituion; defaults to an empty dictionary. Dictionary keys should match a variable specified in the selector.

Returns

A list of the matching ContentNodes. If no matches found, list is empty.

Return type

list[ContentNodes]

>>> document.select('.')
   [ContentNode]
get_labels() List[str]

Args:

Returns

list of associated labels

Return type

List[str]

class kodexa.DocumentActor

Bases: kodexa.model.base.KodexaBaseModel

Provides the definition of an actor in a transition

actor_id :Optional[str]
actor_type :Optional[ActorType]
class kodexa.DocumentMetadata(*args, **kwargs)

Bases: addict.Dict

A flexible dict based approach to capturing metadata for the document

class kodexa.DocumentTransition

Bases: kodexa.model.base.KodexaBaseModel

Provides the definition of a transition for a document, where a change was applied by an assistant, user or external process

id :Optional[str]
uuid :Optional[str]
created_on :Optional[datetime.datetime]
updated_on :Optional[datetime.datetime]
unknown_fields :Optional[Dict[str, str]]
transition_type :Optional[TransitionType]
index :Optional[int]
date_time :Optional[datetime.datetime]
actor :Optional[DocumentActor]
label :Optional[str]
destination_content_object_id :Optional[str]
source_content_object_id :Optional[str]
class kodexa.SourceMetadata

Class for keeping track of the original source information for a document

Args:

Returns:

original_filename :Optional[str]
original_path :Optional[str]
checksum :Optional[str]
cid :Optional[str]
last_modified :Optional[str]
created :Optional[str]
connector :Optional[str]
mime_type :Optional[str]
headers :Optional[addict.Dict]
lineage_document_uuid :Optional[str]
source_document_uuid :Optional[str]
pdf_document_uuid :Optional[str]
classmethod from_dict(env)
Parameters

env

Returns:

class kodexa.TransitionType

Bases: enum.Enum

The type of transition

derived = DERIVED
class kodexa.Taxonomy

Bases: ExtensionPackProvided

Provides the taxonomy hierarchy that is used for content and document classification and labeling

type :Optional[str]
taxonomy_type :Optional[TaxonomyType1]
enabled :Optional[bool]
taxons :Optional[List[Taxon]]
overlays :Optional[List[Overlay]]
total_taxons :Optional[int]
class kodexa.Pipeline(connector=None, name: str = 'Default', stop_on_exception: bool = True, logging_level=logger.info, apply_lineage: bool = True)

A pipeline represents a way to bring together parts of the kodexa framework to solve a specific problem.

When you create a Pipeline you must provide the connector that will be used to source the documents.

Parameters
  • connector – the connector that will be the starting point for the pipeline

  • name – the name of the pipeline (default ‘Default’)

  • stop_on_exception – Should the pipeline raise exceptions and stop (default True)

  • logging_level – The logging level of the pipeline (default INFO)

Returns:

>>> pipeline = Pipeline(FolderConnector(path='/tmp/', file_filter='example.pdf'))
context :PipelineContext
add_label(label: str, options=None, attach_source=False)

Adds a label to the document

Parameters
  • label – label to add

  • options – options to be passed to the step if it is a simplified remote action (Default value = None)

  • attach_source – if step is simplified remote action this determines if we need to add the source (Default value = False)

  • label – str:

Returns

the pipeline

remove_label(label: str, options=None, attach_source=False)

Adds a label to the document

Parameters
  • label – label to remove

  • options – options to be passed to the step if it is a simplified remote action (Default value = None)

  • attach_source – if step is simplified remote action this determines if we need to add the source (Default value = False)

  • label – str: the label to add

Returns

the pipeline

add_step(step, name=None, options=None, attach_source=False, step_type='ACTION')

Add the given step to the current pipeline

Note that it is also possible to add a function as a step, for example

If you are using remote actions on a server, or for deployment to a remote pipeline you can also use a shorthand

Parameters
  • step – the step to add

  • name – the name to use to describe the step (default None)

  • options – options to be passed to the step if it is a simplified remote action (Default value = None)

  • attach_source – if step is simplified remote action this determines if we need to add the source (Default value = False)

  • step_type – the type of step to add, can either be an ACTION or MODEL

Returns

the instance of the pipeline

>>> pipeline = Pipeline(FolderConnector(path='/tmp/', file_filter='example.pdf'))
    >>> pipeline.add_step(ExampleStep())
>>> def my_function(doc):
>>>      doc.metadata.fishstick = 'foo'
>>>      return doc
>>> pipeline.add_step(my_function)
>>> pipeline.add_step('kodexa/html-parser',options={'summarize':False})
to_yaml()

Will return the YAML representation of any actions that support conversion to YAML

The YAML representation for RemoteAction’s can be used for metadata only pipelines in the Kodexa Platform

Returns

YAML representation

Args:

Returns:

run(parameters=None)

Run the current pipeline

Returns

The context from the run

Parameters

parameters – (Default value = None)

Returns:

>>> pipeline = Pipeline(FolderConnector(path='/tmp/', file_filter='example.pdf'))
    >>> pipeline.run()
static from_url(url, headers=None, *args, **kwargs)

Build a new pipeline with the input being a document created from the given URL

Parameters
  • url – The URL ie. https://www.google.com

  • headers – A dictionary of headers (Default value = None)

  • *args

  • **kwargs

Returns

A new instance of a pipeline

static from_file(file_path: str, *args, **kwargs) Pipeline

Create a new pipeline using a file path as a source

Parameters
  • file_path – The path to the file

  • file_path – str:

  • *args

  • **kwargs

Returns

A new pipeline

Return type

Pipeline

static from_text(text: str, *args, **kwargs) Pipeline

Build a new pipeline and provide text as the basic to create a document

Parameters
  • text – Text to use to create document

  • text – str:

  • *args

  • **kwargs

Returns

A new pipeline

Return type

Pipeline

static from_folder(folder_path: str, filename_filter: str = '*', recursive: bool = False, relative: bool = False, unpack=False, caller_path: str = get_caller_dir(), *args, **kwargs) Pipeline

Create a pipeline that will run against a set of local files from a folder

Parameters
  • folder_path – The folder path

  • filename_filter – The filter for filename (i.e. *.pdf)

  • recursive – Should we look recursively in sub-directories (default False)

  • relative – Is the folder path relative to the caller (default False)

  • caller_path – The caller path (defaults to trying to work this out from the stack)

  • unpack – Treat the files in the folder as KDXA documents and unpack them using from_kdxa (default False)

  • folder_path – str:

  • filename_filter – str: (Default value = “*”)

  • recursive – bool: (Default value = False)

  • relative – bool: (Default value = False)

  • caller_path – str: (Default value = get_caller_dir())

  • *args

  • **kwargs

Returns

A new pipeline

Return type

Pipeline

class kodexa.PipelineContext(content_provider=None, existing_content_objects=None, context=None, execution_id=None, status_handler=None, cancellation_handler=None)

Pipeline context is created when you create a pipeline and it provides a way to access information about the pipeline that is running. It can be made available to steps/functions so they can interact with it.

It also provides access to the ‘stores’ that have been added to the pipeline

Args:

Returns:

update_status(status_message: str, status_full_message: Optional[str] = None)
is_cancelled() bool
get_context() Dict
get_content_objects() List[kodexa.model.ContentObject]
get_content(content_object: kodexa.model.ContentObject)
Parameters

content_object – ContentObject:

Returns:

put_content(content_object: kodexa.model.ContentObject, content)
Parameters
  • content_object – ContentObject:

  • content

Returns:

set_current_document(current_document: kodexa.model.Document)

Set the Document that is currently being processed in the pipeline

Parameters
  • current_document – The current document

  • current_document – Document:

Returns:

get_current_document() kodexa.model.Document

Get the current document that is being processed in the pipeline

Returns

The current document, or None

Args:

Returns:

set_output_document(output_document: kodexa.model.Document)

Set the output document from the pipeline

Parameters
  • output_document – the final output document from the pipeline

  • output_document – Document:

Returns

the final output document

class kodexa.PipelineStatistics

A set of statistics for the processed document

documents_processed document_exceptions

Args:

Returns:

processed_document(document)

Update statistics based on this document completing processing

Parameters

document – the document that has been processed

Returns:

class kodexa.KodexaPlatform

The KodexaPlatform object allows you to work with an instance of the Kodexa platform, allow you to list, view and deploy components

Note it also can be used to get your access token and Kodexa platform URL using:

  • A user config file if available

  • Environment variables (KODEXA_ACCESS_TOKEN and KODEXA_URL)

static get_client()
static get_access_token() str

Returns the access token

>>> access_token = KodexaPlatform.get_access_token()

Returns: The access token if it is defined in the user config store, or as an environment variable

static get_url() str

Returns the URL to use to access a Kodexa Platform

The URL should be in the form https://my-company.kodexa.ai

>>> access_token = KodexaPlatform.get_url()

Returns: The URL if it is defined in the user config store, or as an environment variable

static set_access_token(access_token: str)

Set to override the access token to use, not that this does not impact your user config stored value

Parameters

access_token – str: The new access token

Returns: None

static set_url(url: str)

Set to override the URL to use, not that this does not impact your user config stored value

Parameters

url – str: The new URL

Returns: None

static get_access_token_details() addict.Dict

Pull the access token details (including a list of the available organizations)

Returns: Dict: details of the access token

static resolve_ref(ref: str)
classmethod login(kodexa_url, username, password)
classmethod get_server_info()
classmethod get_tempdir()
class kodexa.RemoteStep(ref, step_type='ACTION', attach_source=False, options=None)

Allows you to interact with a step that has been deployed in the Kodexa platform

to_dict()
get_name()
process(document, context)
to_configuration()

Returns a dictionary representing the configuration information for the step

Returns

dictionary representing the configuration of the step

Args:

Returns:

class kodexa.RemotePipeline(slug, connector, version=None, attach_source=True, parameters=None, auth=None)

Allow you to interact with a pipeline that has been deployed to an instance of Kodexa Platform

run()
static from_url(slug: str, url, headers=None, *args, **kwargs) RemotePipeline

Build a new pipeline with the input being a document created from the given URL

Parameters
  • slug – The slug for the remote pipeline

  • url – The URL ie. https://www.google.com

  • headers – A dictionary of headers (Default value = None)

  • slug – str:

  • *args

  • **kwargs

Returns

A new instance of a remote pipeline

static from_file(slug: str, file_path: str, unpack: bool = False, *args, **kwargs) RemotePipeline

Create a new pipeline using a file path as a source

Parameters
  • slug – The slug for the remote pipeline

  • file_path – The path to the file

  • unpack – Unpack the file as a KDXA

  • slug – str:

  • file_path – str:

  • unpack – bool: (Default value = False)

  • *args

  • **kwargs

Returns

A new pipeline

Return type

Pipeline

static from_text(slug: str, text: str, *args, **kwargs) RemotePipeline

Build a new pipeline and provide text as the basic to create a document

Parameters
  • slug – The slug for the remote pipeline

  • text – Text to use to create document

  • slug – str:

  • text – str:

  • *args

  • **kwargs

Returns

A new pipeline

Return type

RemotePipeline

static from_folder(slug: str, folder_path: str, filename_filter: str = '*', recursive: bool = False, unpack: bool = False, relative: bool = False, caller_path: str = get_caller_dir()) RemotePipeline

Create a pipeline that will run against a set of local files from a folder

Parameters
  • slug – The slug for the remote pipeline

  • folder_path – The folder path

  • filename_filter – The filter for filename (i.e. *.pdf)

  • recursive – Should we look recursively in sub-directories (default False)

  • relative – Is the folder path relative to the caller (default False)

  • caller_path – The caller path (defaults to trying to work this out from the stack)

  • unpack – Unpack the file as a KDXA document

  • slug – str:

  • folder_path – str:

  • filename_filter – str: (Default value = “*”)

  • recursive – bool: (Default value = False)

  • unpack – bool: (Default value = False)

  • relative – bool: (Default value = False)

  • caller_path – str: (Default value = get_caller_dir())

Returns

A new pipeline

Return type

RemotePipeline

class kodexa.RemoteSession(session_type, slug)

A Session on the Kodexa platform for leveraging pipelines and services

get_action_metadata(ref)
Parameters

ref

Returns:

start()
execution_action(document, options, attach_source, context)
wait_for_execution(execution)
get_output_document(execution)

Get the output document from a given execution

Parameters

execution – the execution holding the document

Returns

the output document (or None if there isn’t one)

class kodexa.KodexaClient(url=None, access_token=None)
static login(url, email, password)
property me
property platform kodexa.model.objects.PlatformOverview
change_password(old_password: str, new_password: str)
reindex()
__build_object(ref, object_type_metadata)
get_object_by_ref(object_type: str, ref: str) pydantic.BaseModel
get_object_endpoint(object_type: str) pydantic.BaseModel
get_platform()
exists(url, params=None) bool
get(url, params=None) requests.Response
post(url, body=None, data=None, files=None, params=None) requests.Response
put(url, body=None, data=None, files=None, params=None) requests.Response
delete(url, params=None) requests.Response
get_url(url)
export_project(project: ProjectEndpoint, export_path: str)
import_project(organization: OrganizationEndpoint, import_path: str)
deserialize(component_dict: dict, component_type: Optional[str] = None) ComponentInstanceEndpoint
get_project(project_id) ProjectEndpoint
get_object_type(object_type, organization: Optional[OrganizationEndpoint] = None) ClientEndpoint
class kodexa.NodeTagCopy(selector, existing_tag_name, new_tag_name)

The NodeTagCopy action allows you select nodes specified by the selector and create copies of the existing_tag (if it exists) with the new_tag_name. If a tag with the ‘existing_tag_name’ does not exist on a selected node, no action is taken for that node.

selector

The selector to match the nodes

existing_tag_name

The existing tag name that will be the source

new_tag_name

The new tag name that will be the destination

process(document)
class kodexa.NodeTagger(selector, tag_to_apply, content_re='.*', use_all_content=True, node_only=False, node_tag_uuid=None)

A node tagger allows you to provide a type and content regular expression and then tag content in all matching nodes.

It allows for multiple matching groups to be defined, also the ability to use all content and also just tag the node (ignoring the matching groups)

selector

The selector to use to find the node(s) to tag

content_re

A regular expression used to match the content in the identified nodes

use_all_content

A flag that will assume that all content should be tagged (there will be no start/end)

tag_to_apply

The tag to apply to the node(s)

node_only

Tag the node only and no content

node_tag_uuid

The UUID to use on the tag

process(document)
class kodexa.RollupTransformer(collapse_type_res=None, reindex: bool = True, selector: str = '.', separator_character: str = None, get_all_content: bool = False)

The rollup step allows you to decide how you want to collapse content in a document by removing nodes while maintaining content and features as needed

process(document)
is_node_in_list(node, node_ids)
Parameters
  • node

  • node_ids

Returns:

class kodexa.TextParser(encoding='utf-8', lines_as_child_nodes=False)

Parser to load a source file as a text document. The text from the document may be placed on the root ContentNode or on the root’s child nodes (controlled by lines_as_child_nodes).

encoding

The encoding that should be used when attempting to decode data (default ‘utf-8’)

lines_as_child_nodes

If True, the lines of the file will be set as children of the root ContentNode; otherwise, the entire file content is set on the root ContentNode. (default False)

decode_text(data)
process(document)
exception kodexa.KodexaProcessingException(message, description, advice=None, documentation_url=None)

Bases: Exception

This is a specialized exception, if thrown while in the Kodexa Platform we will include the additional exception details so that they can be presented back to the user

description

The description of the problem, this is longer description

advice

Any advice on how to handle the problem, this can also include markdown to help present possible solutions

message

A short message to express the problem

documentation_url

A link to a URL where the user might find more information on the problem

__str__()

Return str(self).