kodexa.model
Model represents the core model at the heart of the Kodexa Content Model and architecture.
It allows you to define:
Documents
Pipelines
Steps
and much more….
Document families allow the organization of documents based on transitions and actors
Submodules
Package Contents
Classes
A feature allows you to capture almost any additional data or metadata and associate it with a ContentNode |
|
A Content Node identifies a section of the document containing logical |
|
A Document is a collection of metadata and a set of content nodes. |
|
A flexible dict based approach to capturing metadata for the document |
|
Class for keeping track of the original source information for a |
|
A content exception represents an issue identified during labeling or validation at the document level |
|
The type of content |
|
The type of transition |
|
Provides the definition of an actor in a transition |
|
Provides the definition of a transition for a document, where a change was applied by an assistant, user or external process |
|
The type of actor |
|
Generic enumeration. |
|
Generic enumeration. |
|
Provides the taxonomy hierarchy that is used for content and document classification and labeling |
|
The Sqlite persistence engine to support large scale documents (part of the V4 Kodexa Document Architecture) |
|
The persistence manager supports holding the document and only flushing objects to the persistence layer |
- class kodexa.model.ContentFeature(feature_type: str, name: str, value: Any, single: bool = True)
Bases:
objectA feature allows you to capture almost any additional data or metadata and associate it with a ContentNode
- feature_type :str
The type of feature, a logical name to group feature types together (ie. spatial)
- name :str
The name of the feature (ie. bbox)
- value :Any
Description of the feature (Optional)
- single :bool
Determines whether the data for this feature is a single instance or an array, if you have added the same feature to the same node you will end up with multiple data elements in the content feature and the single flag will be false
- __str__()
Return str(self).
- to_dict()
Create a dictionary representing this ContentFeature’s structure and content. :returns: The properties of this ContentFeature structured as a dictionary. :rtype: dict
>>> node.to_dict()
- get_value()
Get the value from the feature. This method will handle the single flag
- Returns
The value of the feature
- class kodexa.model.ContentNode(document, node_type: str, content: Optional[str] = None, content_parts: Optional[List[Any]] = None, parent=None, index: Optional[int] = None, virtual: bool = False)
Bases:
objectA Content Node identifies a section of the document containing logical grouping of information.
The node will have content and can include any number of features.
You should always create a node using the Document’s create_node method to ensure that the correct mixins are applied.
>>> new_page = document.create_node(node_type='page') <kodexa.model.model.ContentNode object at 0x7f80605e53c8> >>> current_content_node.add_child(new_page)
>>> new_page = document.create_node(node_type='page', content='This is page 1') <kodexa.model.model.ContentNode object at 0x7f80605e53c8> >>> current_content_node.add_child(new_page)
- node_type :str
The node type (ie. line, page, cell etc)
- document :Document
The document that the node belongs to
- _content_parts :Optional[List[Any]]
The children of the content node
- index :Optional[int]
The index of the content node
- uuid :Optional[int]
The ID of the content node
- virtual :bool
Is the node virtual (ie. it doesn’t actually exist in the document)
- get_content_parts()
- set_content_parts(content_parts)
- property content
- __eq__(other)
Return self==value.
- get_parent()
- __str__()
Return str(self).
- to_json()
Create a JSON string representation of this ContentNode.
Args:
- Returns
The JSON formatted string representation of this ContentNode.
- Return type
>>> node.to_json()
- to_dict()
Create a dictionary representing this ContentNode’s structure and content.
Args:
- Returns
The properties of this ContentNode and all of its children structured as a dictionary.
- Return type
>>> node.to_dict()
- static from_dict(document, content_node_dict: addict.Dict, parent=None)
Build a new ContentNode from a dictionary represention.
- Parameters
document (Document) – The Kodexa document from which the new ContentNode will be created (not added).
content_node_dict (Dict) – The dictionary-structured representation of a ContentNode. This value will be unpacked into a ContentNode.
parent (Optional[ContentNode]) – Optionally the parent content node
- Returns
A ContentNode containing the unpacked values from the content_node_dict parameter.
- Return type
>>> ContentNode.from_dict(document, content_node_dict)
- add_child_content(node_type: str, content: str, index: Optional[int] = None) ContentNode
Convenience method to allow you to quick add a child node with a type and content
- Parameters
node_type – the node type
content – the content
index – the index (optional) (Default value = None)
- Returns
the new ContentNode
- add_child(child, index: Optional[int] = None)
Add a ContentNode as a child of this ContentNode
- Parameters
child (ContentNode) – The node that will be added as a child of this node
index (Optional[int]) – The index at which this child node should be added; defaults to None. If None, index is set as the count of child node elements.
Returns:
>>> new_page = document.create_node(node_type='page') <kodexa.model.model.ContentNode object at 0x7f80605e53c8> >>> current_content_node.add_child(new_page)
- remove_child(content_node)
- get_children()
Returns a list of the children of this node.
- Returns
The list of child nodes for this ContentNode.
- Return type
>>> node.get_children()
- set_feature(feature_type, name, value)
Sets a feature for this ContentNode, replacing the value if a feature by this type and name already exists.
- Parameters
- Returns
The feature that was added to this ContentNode
- Return type
>>> new_page = document.create_node(node_type='page') <kodexa.model.model.ContentNode object at 0x7f80605e53c8> >>> new_page.add_feature('pagination','pageNum',1)
- add_feature(feature_type, name, value, single=True, serialized=False)
Add a new feature to this ContentNode.
Note: if a feature for this feature_type/name already exists, the new value will be added to the existing feature; therefore the feature value might become a list.
- Parameters
feature_type (str) – The type of feature to be added to the node.
name (str) – The name of the feature.
value (Any) – The value of the feature.
single (boolean) – Indicates that the value is singular, rather than a collection (ex: str vs list); defaults to True.
serialized (boolean) – Indicates that the value is/is not already serialized; defaults to False.
- Returns
The feature that was added to this ContentNode.
- Return type
>>> new_page = document.create_node(node_type='page') <kodexa.model.model.ContentNode object at 0x7f80605e53c8> >>> new_page.add_feature('pagination','pageNum',1)
- delete_children(nodes: Optional[List] = None, exclude_nodes: Optional[List] = None)
- Delete the children of this node, you can either supply a list of the nodes to delete
or the nodes to exclude from the delete, if neither are supplied then we delete all the children.
Note there is precedence in place, if you have provided a list of nodes to delete then the nodes to exclude is ignored.
- Parameters
nodes – Optional[List[ContentNode]] a list of content nodes that are children to delete
exclude_nodes – Optional[List[ContentNode]] a list of content node that are children not to delete
nodes – Optional[List]: (Default value = None)
exclude_nodes – Optional[List]: (Default value = None)
- get_feature(feature_type, name)
Gets the value for the given feature.
- Parameters
- Returns
The feature with the specified type & name. If no feature is found, None is returned. Note that if there are more than one instance of the feature you will only get the first one
- Return type
ContentFeature or None
>>> new_page.get_feature('pagination','pageNum') 1
- get_features_of_type(feature_type)
Get all features of a specific type.
- Parameters
feature_type (str) – The type of the feature.
- Returns
A list of feature with the specified type. If no features are found, an empty list is returned.
- Return type
>>> new_page.get_features_of_type('my_type') []
- has_feature(feature_type: str, name: str)
Determines if a feature with the given feature and name exists on this content node.
- Parameters
- Returns
True if the feature is present; else, False.
- Return type
>>> new_page.has_feature('pagination','pageNum') True
- get_features()
Get all features on this ContentNode.
- Returns
A list of the features on this ContentNode.
- Return type
- remove_feature(feature_type: str, name: str, include_children: bool = False)
Removes the feature with the given name and type from this node.
- Parameters
>>> new_page.remove_feature('pagination','pageNum')
- get_feature_value(feature_type: str, name: str) Optional[Any]
Get the value for a feature with the given name and type on this ContentNode.
- Parameters
- Returns
The value of the feature if it exists on this ContentNode otherwise, None, note this only returns the first value (check single to determine if there are multiple)
- Return type
Any or None
>>> new_page.get_feature_value('pagination','pageNum') 1
- get_feature_values(feature_type: str, name: str) Optional[List[Any]]
Get the value for a feature with the given name and type on this ContentNode.
- Parameters
- Returns
The list of feature values or None if there is no feature
>>> new_page.get_feature_value('pagination','pageNum') 1
- get_content()
Get the content of this node.
Args:
- Returns
The content of this ContentNode.
- Return type
>>> new_page.get_content() "This is page one"
- get_node_type()
Get the type of this node.
Args:
- Returns
The type of this ContentNode.
- Return type
>>> new_page.get_content() "page"
- select_first(selector, variables=None)
Select and return the first child of this node that match the selector value.
- Parameters
- Returns
The first matching node or none
- Return type
Optional[ContentNode]
>>> document.get_root().select_first('.') ContentNode
>>> document.get_root().select_first('//*[hasTag($tagName)]', {"tagName": "div"}) ContentNode
- select(selector, variables=None)
Select and return the child nodes of this node that match the selector value.
- Parameters
- Returns
A list of the matching content nodes. If no matches are found, the list will be empty.
- Return type
>>> document.get_root().select('.') [ContentNode]
>>> document.get_root().select('//*[hasTag($tagName)]', {"tagName": "div"}) [ContentNode]
- get_all_content(separator=' ', strip=True)
Get this node’s content, concatenated with all of its children’s content.
- Parameters
separator (str, optional) – The separator to use in joining content together; defaults to ” “.
strip (boolean, optional) – Strip the result
- Returns
The complete content for this node concatenated with the content of all child nodes.
- Return type
>>> document.content_node.get_all_content()
“This string is made up of multiple nodes”
- adopt_children(nodes_to_adopt, replace=False)
This will take a list of content nodes and adopt them under this node, ensuring they are re-parented.
- Parameters
children (List[ContentNode]) – A list of ContentNodes that will be added to the end of this node’s children collection
replace (bool) – If True, will remove all current children and replace them with the new list; defaults to True
>>> # select all nodes of type 'line', then the root node 'adopts' them >>> # and replaces all it's existing children with these 'line' nodes. >>> document.get_root().adopt_children(document.select('//line'), replace=True)
- remove_tag(tag_name)
Remove a tag from this content node.
- Parameters
str – tag_name: The name of the tag that should be removed.
tag_name –
Returns:
>>> document.get_root().remove_tag('foo')
- set_statistics(statistics)
Set the spatial statistics for this node
- Parameters
statistics – the statistics object
Returns:
>>> document.select.('//page')[0].set_statistics(NodeStatistics())
- get_statistics()
Get the spatial statistics for this node
- Returns
the statistics object (or None if not set)
Args:
Returns:
>>> document.select.('//page')[0].get_statistics() <kodexa.spatial.NodeStatistics object at 0x7f80605e53c8>
- set_bbox(bbox)
Set the bounding box for the node, this is structured as:
[x1,y1,x2,y2]
- Parameters
bbox – the bounding box array
>>> document.select.('//page')[0].set_bbox([10,20,50,100])
- get_bbox()
Get the bounding box for the node, this is structured as:
[x1,y1,x2,y2]
- Returns
the bounding box array
>>> document.select.('//page')[0].get_bbox() [10,20,50,100]
- set_bbox_from_children()
Set the bounding box for this node based on its children
- set_rotate(rotate)
Set the rotate of the node
- Parameters
rotate – the rotation of the node
Returns:
>>> document.select.('//page')[0].set_rotate(90)
- get_rotate()
Get the rotate of the node
- Returns
the rotation of the node
Args:
Returns:
>>> document.select.('//page')[0].get_rotate() 90
- get_x()
Get the X position of the node
- Returns
the X position of the node
Args:
Returns:
>>> document.select.('//page')[0].get_x() 10
- get_y()
Get the Y position of the node
- Returns
the Y position of the node
Args:
Returns:
>>> document.select.('//page')[0].get_y() 90
- get_width()
Get the width of the node
- Returns
the width of the node
Args:
Returns:
>>> document.select.('//page')[0].get_width() 70
- get_height()
Get the height of the node
- Returns
the height of the node
Args:
Returns:
>>> document.select.('//page')[0].get_height() 40
- copy_tag(selector='.', existing_tag_name=None, new_tag_name=None)
Creates a new tag of ‘new_tag_name’ on the selected content node(s) with the same information as the tag with ‘existing_tag_name’. Both existing_tag_name and new_tag_name values are required and must be different from one another. Otherwise, no action is taken. If a tag with the ‘existing_tag_name’ does not exist on a selected node, no action is taken for that node.
- Parameters
selector – The selector to identify the source nodes to work on (default . - the current node)
str – existing_tag_name: The name of the existing tag whose values will be copied to the new tag.
str – new_tag_name: The name of the new tag. This must be different from the existing_tag_name.
existing_tag_name – (Default value = None)
new_tag_name – (Default value = None)
Returns:
>>> document.get_root().copy_tag('foo', 'bar')
- collect_nodes_to(end_node)
Get the the sibling nodes between the current node and the end_node.
- Parameters
ContentNode – end_node: The node to end at
end_node –
- Returns
A list of sibling nodes between this node and the end_node.
- Return type
>>> document.content_node.get_children()[0].collect_nodes_to(end_node=document.content_node.get_children()[5])
- tag_nodes_to(end_node, tag_to_apply, tag_uuid: str = None)
Tag all the nodes from this node to the end_node with the given tag name
- Parameters
end_node (ContentNode) – The node to end with
tag_to_apply (str) – The tag name that will be applied to each node
tag_uuid (str) – The tag uuid used if you want to group them
>>> document.content_node.get_children()[0].tag_nodes_to(document.content_node.get_children()[5], tag_name='foo')
- tag_range(start_content_re, end_content_re, tag_to_apply, node_type_re='.*', use_all_content=False)
This will tag all the child nodes between the start and end content regular expressions
- Parameters
start_content_re – The regular expression to match the starting child
end_content_re – The regular expression to match the ending child
tag_to_apply – The tag name that will be applied to the nodes in range
node_type_re – The node type to match (default is all)
use_all_content – Use full content (including child nodes, default is False)
Returns:
>>> document.content_node.tag_range(start_content_re='.*Cheese.*', end_content_re='.*Fish.*', tag_to_apply='foo')
- tag(tag_to_apply, selector='.', content_re=None, use_all_content=False, node_only=None, fixed_position=None, data=None, separator=' ', tag_uuid: str = None, confidence=None, value=None, use_match=True, index=None, cell_index=None, group_uuid=None, parent_group_uuid=None, note=None, status=None)
This will tag (see Feature Tagging) the expression groups identified by the regular expression.
Note that if you use the flag use_all_content then node_only will default to True if not set, else it will default to False
- Parameters
tag_to_apply – The name of tag that will be applied to the node
selector – The selector to identify the source nodes to work on (default . - the current node)
content_re – The regular expression that you wish to use to tag, note that we will create a tag for each matching group (Default value = None)
use_all_content – Apply the regular expression to the all_content (include content from child nodes) (Default value = False)
separator – Separator to use for use_all_content (Default value = ” “)
node_only – Ignore the matching groups and tag the whole node (Default value = None)
fixed_position – Use a fixed position, supplied as a tuple i.e. - (4,10) tag from position 4 to 10 (default None)
data – A dictionary of data for the given tag (Default value = None)
tag_uuid –
A UUID used to tie tags in order to demonstrate they’re related and form a single concept. For example, if tagging the two words “Wells” and “Fargo” as an ORGANIZATION, the tag on both words should have the same tag_uuid in order to indicate they are both needed to form the single ORGANIZATION. If a tag_uuid is provided, it is used on all tags created in this method. This may result in multiple nodes or multiple feature values having the same tag_uuid. For example, if the selector provided results in more than one node being selected, each node would be tagged with the same tag_uuid. The same holds true if a content_re value is provided, node_only is set to False, and multiple matches are found for the content_re pattern. In that case, each feature value would share the same UUID. If no tag_uuid is provided, a new uuid is generated for each tag instance.
tag_uuid: str: (Default value = None)
confidence – The confidence in the tag (0-1)
value – The value you wish to store with the tag, this allows you to provide text that isn’t part of the content but represents the data you wish tagged
use_match – If True (default) we will use match for regex matching, if False we will use search
index – The index for the tag
cell_index – The cell index for the tag
group_uuid – The group uuid for the tag
parent_group_uuid – The parent group uuid for the tag
note – a text note for the tag
status – a status for the tag, this can be transistioned to an attribute status during extraction
>>> document.content_node.tag('is_cheese')
- get_tags()
Returns a list of the names of the tags on the given node
- Returns
A list of the tag name
Args:
Returns:
>>> document.content_node.select('*').get_tags() ['is_cheese']
- get_tag_features()
Returns a list of the features that are tags on the given node
- Returns
A list of the tag name
Args:
Returns:
>>> document.content_node.select('*').get_tag_features() [ContentFeature()]
- get_tag_values(tag_name, include_children=False)
Get the values for a specific tag name
- Parameters
tag_name – tag name
include_children – include the children of this node (Default value = False)
- Returns
a list of the tag values
Get the values for a specific tag name, grouped by uuid
Get the nodes for a specific tag name, grouped by uuid
- get_tag(tag_name, tag_uuid=None)
Returns the value of a tag (a dictionary), this can be either a single value in a list [[start,end,value]] or if multiple parts of the content of this node match you can end up with a list of lists i.e. [[start1,end1,value1],[start2,end2,value2]]
- Parameters
tag_name – The name of the tag
tag_uuid (Optional) – Optionally you can also provide the tag UUID
- Returns
A list tagged location and values for this label in this node
>>> document.content_node.select_first('//*[contentRegex(".*Cheese.*")]').get_tag('is_cheese') [0,10,'The Cheese Moved']
- get_all_tags()
Get the names of all tags that have been applied to this node or to its children.
Args:
>>> document.content_node.select_first('//*[contentRegex(".*Cheese.*")]').get_all_tags() ['is_cheese']
- has_tags()
Determines if this node has any tags at all.
Args:
- Returns
True if node has any tags; else, False;
- Return type
>>> document.content_node.select_first('//*[contentRegex(".*Cheese.*")]').has_tags() True
- has_tag(tag, include_children=False)
Determine if this node has a tag with the specified name.
- Parameters
- Returns
True if node has a tag by the specified name; else, False;
- Return type
>>> document.content_node.select_first('//*[contentRegex(".*Cheese.*")]').has_tag('is_cheese') True >>> document.content_node.select_first('//*[contentRegex(".*Cheese.*")]').has_tag('is_fish') False
- is_first_child()
Determines if this node is the first child of its parent or has no parent.
Args:
- Returns
True if this node is the first child of its parent or if this node has no parent; else, False;
- Return type
- is_last_child()
Determines if this node is the last child of its parent or has no parent.
- Returns
True if this node is the last child of its parent or if this node has no parent; else, False;
- Return type
- get_last_child_index()
Returns the max index value for the children of this node. If the node has no children, returns None.
- Returns
The max index of the children of this node, or None if there are no children.
- Return type
int or None
- get_node_at_index(index)
Returns the child node at the specified index. If the specified index is outside the first (0), or last child’s index, None is returned.
Note: documents allow for sparse representation and child nodes may not have consecutive index numbers. If there isn’t a child node at the specfied index, a ‘virtual’ node will be returned. This ‘virtual’ node will have the node type of its nearest sibling and will have an index value, but will have no features or content.
- Parameters
index (int) – The index (zero-based) for the child node.
- Returns
Node at index, or None if the index is outside the boundaries of child nodes.
- Return type
ContentNode or None
- has_next_node(node_type_re='.*', skip_virtual=False)
Determine if this node has a next sibling that matches the type specified by the node_type_re regex.
- Parameters
- Returns
True if there is a next sibling node matching the specified type regex; else, False.
- Return type
- has_previous_node(node_type_re='.*', skip_virtual=False)
Determine if this node has a previous sibling that matches the type specified by the node_type_re regex.
- Parameters
- Returns
True if there is a previous sibling node matching the specified type regex; else, False.
- Return type
- next_node(node_type_re='.*', skip_virtual=False, has_no_content=True)
Returns the next sibling content node.
Note: This logic relies on node indexes. Documents allow for sparse representation and child nodes may not have consecutive index numbers. Therefore, the next node might actually be a virtual node that is created to fill a gap in the document. You can skip virtual nodes by setting the skip_virtual parameter to False.
- Parameters
node_type_re (str, optional, optional) – The regular expression to match against the next sibling node’s type; default is ‘.*’.
skip_virtual (bool, optional, optional) – Skip virtual nodes and return the next real node; default is False.
has_no_content (bool, optional, optional) – Allow a node that has no content to be returned; default is True.
- Returns
The next node or None, if no node exists
- Return type
ContentNode or None
- previous_node(node_type_re='.*', skip_virtual=False, has_no_content=False, traverse=Traverse.SIBLING)
Returns the previous sibling content node.
Note: This logic relies on node indexes. Documents allow for sparse representation and child nodes may not have consecutive index numbers. Therefore, the previous node might actually be a virtual node that is created to fill a gap in the document. You can skip virtual nodes by setting the skip_virtual parameter to False.
- Parameters
node_type_re (str, optional, optional) – The regular expression to match against the previous node’s type; default is ‘.*’.
skip_virtual (bool, optional, optional) – Skip virtual nodes and return the next real node; default is False.
has_no_content (bool, optional, optional) – Allow a node that has no content to be returned; default is False.
traverse (Traverse(enum), optional, optional) – The transition you’d like to traverse (SIBLING, CHILDREN, PARENT, or ALL); default is Traverse.SIBLING.
- Returns
The previous node or None, if no node exists
- Return type
ContentNode or None
- class kodexa.model.Document(metadata=None, content_node: ContentNode = None, source=None, ref: str = None, kddb_path: str = None, delete_on_close=False)
Bases:
objectA Document is a collection of metadata and a set of content nodes.
- PREVIOUS_VERSION :str = 1.0.0
- CURRENT_VERSION :str = 4.0.2
- metadata :DocumentMetadata
Metadata relating to the document
- _content_node :Optional[ContentNode]
The root content node
- virtual :bool = False
Is the document virtual (deprecated)
- _mixins :List[str] = []
A list of the mixins for this document
- uuid :str
The UUID of this document
- exceptions :List = []
A list of the exceptions on this document (deprecated)
- log :List[str] = []
A log for this document (deprecated)
- version
The version of the document
- source :SourceMetadata
Source metadata for this document
- labels :List[str] = []
A list of the document level labels for the document
- taxonomies :List[str] = []
A list of the taxonomy references for this document
- classes :List[ContentClassification] = []
A list of the content classifications associated at the document level
- __str__()
Return str(self).
- add_exception(exception: ContentException)
- get_persistence()
- get_all_tags()
- get_tagged_nodes(tag_name, tag_uuid=None)
- property content_node
The root content Node
- add_classification(label: str, taxonomy_ref: Optional[str] = None) ContentClassification
Add a content classification to the document
- add_label(label: str)
Add a label to the document
- Parameters
label – str Label to add
label – str:
- Returns
the document
- remove_label(label: str)
Remove a label from the document
- Parameters
label – str Label to remove
label – str:
- Returns
the document
- classmethod from_text(text, separator=None)
Creates a new Document from the text provided.
- Parameters
text – str Text to be used as content on the Document’s ContentNode(s)
separator – str If provided, this string will be used to split the text and the resulting text will be placed on children of the root ContentNode. (Default value = None)
- Returns
the document
- get_root()
Get the root content node for the document (same as content_node)
- to_kdxa(file_path: str)
Write the document to the kdxa format (msgpack) which can be used with the Kodexa platform
- Parameters
file_path – the path to the mdoc you wish to create
file_path – str:
Returns:
>>> document.to_mdoc('my-document.kdxa')
- static open_kddb(file_path)
Opens a Kodexa Document Database.
This is the Kodexa V4 default way to store documents, it provides high-performance and also the ability to handle very large document objects
- Parameters
file_path – The file path
- Returns
The Document instance
- close()
Close the document and clean up the resources
- to_kddb(path=None)
Either write this document to a KDDB file or convert this document object structure into a KDDB and return a bytes-like object
This is dependent on whether you provide a path to write to
- static from_kdxa(file_path)
Read an .kdxa file from the given file_path and
- Parameters
file_path – the path to the mdoc file
Returns:
>>> document = Document.from_kdxa('my-document.kdxa')
- to_msgpack()
Convert this document object structure into a message pack
- to_json()
Create a JSON string representation of this Document.
Args:
- Returns
The JSON formatted string representation of this Document.
- Return type
>>> document.to_json()
- to_dict()
Create a dictionary representing this Document’s structure and content.
Args:
- Returns
A dictionary representation of this Document.
- Return type
>>> document.to_dict()
- static from_dict(doc_dict)
Build a new Document from a dictionary.
- Parameters
dict – doc_dict: A dictionary representation of a Kodexa Document.
doc_dict –
- Returns
A complete Kodexa Document
- Return type
>>> Document.from_dict(doc_dict)
- static from_json(json_string)
Create an instance of a Document from a JSON string.
- Parameters
str – json_string: A JSON string representation of a Kodexa Document
json_string –
- Returns
A complete Kodexa Document
- Return type
>>> Document.from_json(json_string)
- static from_msgpack(msgpack_bytes)
Create an instance of a Document from a message pack byte array.
- Parameters
msgpack_bytes – bytes: A message pack byte array.
- Returns
A complete Kodexa Document
- Return type
>>> Document.from_msgpack(open(os.path.join('news-doc.kdxa'), 'rb').read())
- get_mixins()
Get the list of mixins that have been enabled on this document
- Returns
list[str] a list of the mixin names
- Return type
mixins
- add_mixin(mixin)
Add the given mixin to this document, this will apply the mixin to all the content nodes, and also register it with the document so that future invocations of create_node will ensure the node has the mixin appled.
- Parameters
mixin – str the name of the mixin to add
Returns: >>> import * from kodexa >>> document = Document() >>> document.add_mixin(‘spatial’)
- create_node(node_type: str, content: Optional[str] = None, virtual: bool = False, parent: ContentNode = None, index: Optional[int] = None)
Creates a new node for the document. The new node is not added to the document, but any mixins that have been applied to the document will also be available on the new node.
- Parameters
node_type (str) – The type of node.
content (str) – The content for the node; defaults to None.
virtual (bool) – Indicates if this is a ‘real’ or ‘virtual’ node; default is False. ‘Real’ nodes contain document content. ‘Virtual’ nodes are synthesized as necessary to fill gaps in between non-consecutively indexed siblings. Such indexing arises when document content is sparse.
parent (ContentNode) – The parent for this newly created node; default is None;
index (Optional[int)) – The index property to be set on this node; default is 0;
- Returns
This newly created node.
- Return type
>>> document.create_node(node_type='page') <kodexa.model.model.ContentNode object at 0x7f80605e53c8>
- classmethod from_kddb(source, detached: bool = False)
Loads a document from a Kodexa Document Database (KDDB) file
- Parameters
input – if a string we will load the file at that path, if bytes we will create a temp file and load the KDDB to it
detached (bool) – if reading from a file we will create a copy so we don’t update in place
- Returns
the document
- classmethod from_file(file, unpack: bool = False)
Creates a Document that has a ‘file-handle’ connector to the specified file.
- Parameters
file – file: The file to which the new Document is connected.
unpack – bool: (Default value = False)
- Returns
A Document connected to the specified file.
- Return type
- classmethod from_url(url, headers=None)
Creates a Document that has a ‘url’ connector for the specified url.
- Parameters
str – url: The URL to which the new Document is connected.
dict – headers: Headers that should be used when reading from the URL
url –
headers – (Default value = None)
- Returns
A Document connected to the specified URL with the specified headers (if any).
- Return type
- select_first(selector, variables=None) Optional[ContentNode]
Select and return the first child of this node that match the selector value.
- Parameters
- Returns
The first matching node or none
- Return type
Optional[ContentNode]
>>> document.get_root().select_first('.') ContentNode
>>> document.get_root().select_first('//*[hasTag($tagName)]', {"tagName": "div"}) ContentNode
- select(selector: str, variables: Optional[dict] = None) List[ContentNode]
Execute a selector on the root node and then return a list of the matching nodes.
- Parameters
- Returns
A list of the matching ContentNodes. If no matches found, list is empty.
- Return type
list[ContentNodes]
>>> document.select('.') [ContentNode]
- class kodexa.model.DocumentMetadata(*args, **kwargs)
Bases:
addict.DictA flexible dict based approach to capturing metadata for the document
- class kodexa.model.SourceMetadata
Class for keeping track of the original source information for a document
Args:
Returns:
- original_filename :Optional[str]
- original_path :Optional[str]
- checksum :Optional[str]
- cid :Optional[str]
- last_modified :Optional[str]
- created :Optional[str]
- connector :Optional[str]
- mime_type :Optional[str]
- headers :Optional[addict.Dict]
- lineage_document_uuid :Optional[str]
- source_document_uuid :Optional[str]
- pdf_document_uuid :Optional[str]
- classmethod from_dict(env)
- Parameters
env –
Returns:
- class kodexa.model.ContentException(exception_type: str, message: str, severity: str = 'ERROR', tag: Optional[str] = None, group_uuid: Optional[str] = None, tag_uuid: Optional[str] = None, exception_details: Optional[str] = None, node_uuid: Optional[str] = None, *args, **kwargs)
Bases:
addict.DictA content exception represents an issue identified during labeling or validation at the document level
- class kodexa.model.ContentObject
Bases:
kodexa.model.base.KodexaBaseModel- id :Optional[str]
- uuid :Optional[str]
- created_on :Optional[datetime.datetime]
- updated_on :Optional[datetime.datetime]
- content_type :ContentType
- document_version :Optional[str]
- index :Optional[int]
- labels :Optional[List[Label]]
- metadata :Optional[Dict[str, Any]]
- mixins :Optional[List[str]]
- created :Optional[datetime.datetime]
- modified :Optional[datetime.datetime]
- size :Optional[int]
- store_ref :Optional[str]
- document_family_id :Optional[str]
- class kodexa.model.ContentType
Bases:
enum.EnumThe type of content
- document = DOCUMENT
- native = NATIVE
- class kodexa.model.ModelContentMetadata
Bases:
kodexa.model.base.KodexaBaseModel- type :Optional[str]
- model_runtime_ref :Optional[str]
- state_hash :Optional[str]
- model_runtime_parameters :Optional[Dict[str, Any]]
- state :Optional[State]
- trainable :Optional[bool]
- use_implementation_from_template :Optional[bool]
- template_ref :Optional[str]
- vue_template :Optional[str]
- training_options :Optional[List[Option]]
- training_parameters :Optional[Dict[str, Any]]
- inference_options :Optional[List[Option]]
- inference_parameters :Optional[Dict[str, Any]]
- build_statistics :Optional[Dict[str, Any]]
- final_statistics :Optional[Dict[str, Any]]
- deployment :Optional[DeploymentMetadata]
- taxonomy :Optional[Taxonomy]
- additional_taxon_options :Optional[List[Option]]
- contents :Optional[List[str]]
- ignored_contents :Optional[List[str]]
- base_dir :Optional[str]
- class kodexa.model.DocumentContentMetadata
Bases:
kodexa.model.base.KodexaBaseModel- type :Optional[str]
- class kodexa.model.ContentEvent
Bases:
kodexa.model.base.KodexaBaseModel- type :Optional[str]
- content_object :Optional[ContentObject]
- document_family :Optional[DocumentFamily]
- object_event_type :Optional[ObjectEventType]
- class kodexa.model.DocumentActor
Bases:
kodexa.model.base.KodexaBaseModelProvides the definition of an actor in a transition
- actor_id :Optional[str]
- actor_type :Optional[ActorType]
- class kodexa.model.DocumentTransition
Bases:
kodexa.model.base.KodexaBaseModelProvides the definition of a transition for a document, where a change was applied by an assistant, user or external process
- id :Optional[str]
- uuid :Optional[str]
- created_on :Optional[datetime.datetime]
- updated_on :Optional[datetime.datetime]
- unknown_fields :Optional[Dict[str, str]]
- transition_type :Optional[TransitionType]
- index :Optional[int]
- date_time :Optional[datetime.datetime]
- actor :Optional[DocumentActor]
- label :Optional[str]
- destination_content_object_id :Optional[str]
- source_content_object_id :Optional[str]
- class kodexa.model.AssistantEvent
Bases:
kodexa.model.base.KodexaBaseModel- type :Optional[str]
- content_object :Optional[ContentObject]
- options :Optional[Dict[str, Any]]
- event_type :Optional[str]
- assistant :Optional[Assistant]
- class kodexa.model.ActorType
Bases:
enum.EnumThe type of actor
- user = USER
- assistant = ASSISTANT
- access_token = ACCESS_TOKEN
- api = API
- class kodexa.model.StoreType
Bases:
enum.EnumGeneric enumeration.
Derive from this class to define new enumerations.
- document = DOCUMENT
- table = TABLE
- dictionary = DICTIONARY
- model = MODEL
- class kodexa.model.StorePurpose
Bases:
enum.EnumGeneric enumeration.
Derive from this class to define new enumerations.
- operational = OPERATIONAL
- training = TRAINING
- class kodexa.model.Taxonomy
Bases:
ExtensionPackProvidedProvides the taxonomy hierarchy that is used for content and document classification and labeling
- type :Optional[str]
- taxonomy_type :Optional[TaxonomyType1]
- enabled :Optional[bool]
- taxons :Optional[List[Taxon]]
- overlays :Optional[List[Overlay]]
- total_taxons :Optional[int]
- class kodexa.model.ExtensionPack
Bases:
kodexa.model.base.KodexaBaseModel- org_slug :Optional[constr(regex='^[a-zA-Z0-9\\-_]{0,100}$')]
- slug :Optional[constr(regex='^[a-zA-Z0-9\\-_]{0,100}$')]
- name :Optional[str]
- description :Optional[str]
- public_access :Optional[bool]
- pack_uri :Optional[str]
- status :Optional[Status]
- deployable :Optional[bool]
- services :Optional[List[SlugBasedMetadata]]
- source :Optional[ExtensionPackSource]
- deployment :Optional[DeploymentMetadata]
- class kodexa.model.AssistantDefinition
Bases:
ExtensionPackProvided- template :Optional[bool]
- schedulable :Optional[bool]
- reactive :Optional[bool]
- assistant :Optional[AssistantImplementation]
- metadata :Optional[AssistantMetadata]
- services :Optional[List[SlugBasedMetadata]]
- processing_taxonomies :Optional[List[AssistantTaxonomy]]
- options :Optional[List[Option]]
- additional_taxon_options :Optional[List[Option]]
- event_types :Optional[List[CustomEvent]]
- default_schedules :Optional[List[ScheduleDefinition]]
- subscription :Optional[str]
- class kodexa.model.SqliteDocumentPersistence(document: kodexa.model.Document, filename: str = None, delete_on_close=False)
Bases:
objectThe Sqlite persistence engine to support large scale documents (part of the V4 Kodexa Document Architecture)
- get_all_tags()
select * from cn where id in (select cn_id from ft where f_type in (select id from f_type where name like ‘tag:%’))
- update_features(node)
- update_node(node)
- get_content_nodes(node_type, parent_node: kodexa.model.ContentNode, include_children)
- initialize()
- close()
- get_max_feature_id()
- __build_db()
- content_node_count()
- get_feature_type_id(feature)
- __resolve_f_type(feature)
- __resolve_n_type(n_type)
- __insert_node(node: kodexa.model.ContentNode, parent, execute=True)
- __clean_none_values(d)
- __update_metadata()
- __load_document()
- get_content_parts(new_node)
- __build_node(node_row)
- add_content_node(node, parent, execute=True)
- remove_feature(node, feature_type, name)
- get_children(content_node)
- get_child_ids(content_node)
- get_node(node_id)
- get_parent(content_node)
- update_metadata()
- __rebuild_from_document()
- sync()
- get_bytes()
- get_features(node)
- update_content_parts(node, content_parts)
- remove_content_node(node)
- remove_all_features(node)
- remove_all_features_by_id(node_id)
- get_next_node_id()
- get_tagged_nodes(tag, tag_uuid=None)
- add_exception(exception: kodexa.model.model.ContentException)
- class kodexa.model.PersistenceManager(document: kodexa.model.Document, filename: str = None, delete_on_close=False)
Bases:
objectThe persistence manager supports holding the document and only flushing objects to the persistence layer as needed.
This is implemented to allow us to work with large complex documents in a performance centered way.
- add_exception(exception: kodexa.model.model.ContentException)
- get_all_tags()
- get_tagged_nodes(tag, tag_uuid=None)
- initialize()
- get_parent(node)
- close()
- flush_cache()
- get_content_nodes(node_type, parent_node, include_children)
- get_bytes()
- update_metadata()
- add_content_node(node, parent)
- get_node(node_id)
- remove_content_node(node)
- get_children(node)
- update_node(node)
- update_content_parts(node, content_parts)
- get_content_parts(node)
- remove_feature(node, feature_type, name)
- get_features(node)
- add_feature(node, feature)