-
Notifications
You must be signed in to change notification settings - Fork 2
Features Walkthrough
To understand all features, please first familiarize yourself with the Arrows and Nodes Model the framework is based on.
All examples can be run by saving the specification to a .yf file in a directory and executing Scraper in that directory (e.g. docker run -v "$PWD":/rt/ --rm albsch/scraper:latest).
A taskflow can be written in both JSON (file ending .jf or .json) and YML (file ending .yf or .yaml, .yml).
These are the 2 core parsers provided with Scraper with full functionality included.
For more types of parsers or how to implement your own format, see the section Scraper Formats.
An example specification in YML can be defined as follows:
name: helloworld
# entry: start
# globalNodeConfigurations: {}
# imports: {}
graphs:
start:
- type: Echo
log: "Hello world!"
- type: Echo
log: "I accept the output of the previous node!"
- type: Echo
log: "I'm the last node in the graph!"
other:
- type: Echo
log: "I'm not reachable and not used :("Mandatory:
-
name: Identifier for the taskflow specification -
graphs: A map where keys are identifiers a single graph. Values are lists of nodes.
Optional:
-
entry: Which graph is going to be used for the first entry task on startup. Default value isstart. -
globalNodeConfigurations: Used to configure multiple nodes at once. Local node configuration has precedence over global node configuration. Default is no global node configurations. See the Global Node Configuration section. -
imports: Used to import other taskflows into this taskflow. See the Importing Other Taskflows section.
Each node in a graph can be configured depending on its implementation. A node is a key-value map in the specification. The documentation for each configuration of the core and some additional development Scraper nodes can be found here. The documentation (for your own nodes) can be generated, for more information see the Generating Node Documentation section of the developer guide. A template for developing and packaging your own custom nodes quickly can be found in the Developing Custom Nodes section.
Example: Echo creating a static output at key obj of type Map<String, String> (Echo Documentation):
- type: Echo
log: "Creating object"
put: obj
value:
id: "10"
name: "smith"
age: "20"The types of the node configuration has to match the types defined in the node implementation, which can be checked with the generated documentation:
-
put: Where to put the object generated byvalue -
value: The object to put at theputlocation of typeA
Failing to match the types will throw an exception on starting the taskflow.
If a node is extending another node implementation, then it inherits the configuration of its parent. Generally, every node has a basic node configuration provided by Node.
Every node by default forwards to the next node using a dependent arrow.
The next target can be configured via the configuration keys forward and goTo (Node Documentation) where next is defined as follows:
- If
forwardisfalse, then do not forward - If
goTois not set, then forward to the following node in the graph (list) this node is in- If this node is the last node in the graph, then do not forward
- If
goTois set, then resolve the address and forward to that target address (see Address Schema)
If the node implementation does not create other dependent arrows (i.e. modifying control flow), then not forwarding is the same as the flow terminating.
name: goto
graphs:
start:
- type: Echo
goTo: other
- type: Echo
log: "I'm a labelled node!"
label: anecho
- type: Echo
log: "I'm the last node in the graph!"
other:
- type: Echo
log: "A short detour to another graph"
goTo: start.anecho
- type: Echo
log: "I'm not reachable :("It is usually enough and recommended to only address up to graph labels and not address single nodes directly.
A flow map is a key-value map and travels along arrows.
For a dispatched arrow the flow map is copied for the newly created flow.
If multiple dependent arrows originate from the same node, then the node implementation decides which order the flow map is forwarded in.
This is usually marked in the generated control flow graph.
Flow maps are used to carry data around and for dynamic node configuration dependent on the currently accessing flow map via templates.
name: echotest
graphs:
start:
- type: Pipe
pipeTargets: [ok, ok2]
# continue with the pipe result
- type: Echo
log: "My flow map contains both ok and ok2!"
ok:
- type: Echo
put: ok # fills key ok in the accepting flow map
value: hello
ok2:
- type: Echo
put: ok2 # fills key ok2 in the accepting flow map
value: hello
This example has no observable behavior. To inspect and use the contents of the flow map, templates can be used.
To use flow map content and to make the configuration of nodes dependent on input, templates are used. A templating engine is embedded into Strings and follows a simple grammar.
-
{X}: Value of flow map at keyX -
{X}[Y]: List element at indexYof listX. It follows thatXhas to be a nested template which has to evaluate to a list andYan Integer (or template that evaluates to an Integer) -
{X@Y}: Map element at keyYof mapX. It follows thatXhas to be a nested template which has to evaluate to a map andYa String (or template that evaluates to a String) -
{X}{Y}: String concatenation -
helloworld: Static template
name: echoinspect
graphs:
start:
- type: Pipe
pipeTargets: [ok, ok2]
- type: Echo
log: # print out a map
info: "My input contains a list at oklist and a map at okmap"
# {oklist} resolves to the list, {_}[0] inspects the content
msg1: "{{oklist}}[0] {{oklist}}[1]"
msg2: "{okmap}"
msg3: "Is {{okmap}@age} years old"
# use JSON embedding to save a bit of space
ok: [ {type: Echo, put: oklist, value: [hello, world] } ]
# age: 25 does not work, as it is an integer and not a String
# using 25 causes the typechecker to throw an error, prohibiting the execution of the workflow
ok2: [ {type: Echo, put: okmap, value: {name: smith, age: "25"} } ]For the log configuration of Node we use the fact that it has the type T<?> (which is the same as T<Object> in Scraper).
This means String templates can be inside lists and maps, too.
Every nested template is evaluated.
In contrast, put of Echo has the type T<A>, so only homogeneous lists and maps are allowed by the type checker.
If you need to build complex JSON objects, JsonObject can be used.
It has the same API but the type is T<Object>, therefore the resulting output has less type information and is more dangerous to use.
Assume that in the taskflow before you gathered an id, a list of String comments at comments and a String title at title. To save a JSON document to disk, you first have to build it, JsonObject can be used for this purpose:
name: staticoutput
graphs:
start:
# replace these Echos with another taskflow that gathers information
- {type: Echo, put: id, value: 1}
- {type: Echo, put: title, value: "hello world"}
- {type: Echo, put: comments, value: ["lorem", "ipsum"]}
- type: JsonObject # Using Echo results in a type error!
put: jsondoc
value:
id: "{id}"
title: "{title}"
comments: "{comments}" # a list!
new: true # some static information
- type: ObjectToJsonString
object: "{jsondoc}"
#result: "result"
- type: WriteLineToFile
output: "out.json"
line: "{result}"
overwrite: trueNodes used: Echo, JsonObject,ObjectToJsonString, WriteLineToFile
The absolute address schema is
taskflow.graph.label or taskflow.graph.index.
Relative address schemas are allowed.
addr as seen from a node can be (checked in this order):
- local node with label
addr - graph id
addr - imported taskflow
addr
addr1.addr2 as seen from a node can be (checked in this order):
- node
addr2 in graphaddr1` - graph
addr2in imported taskflowaddr1
name: myflow
graphs:
start:
- type: Pipe
pipeTargets: ["localnode", "localgraph", "graph.nodeingraph", "myflow.graph.0"]
- { type: Echo, log: "Finished!", forward: false }
- { type: Echo, log: "I'm addressed directly", label: localnode }
localgraph:
- { type: Echo, log: "localgraph!" }
graph:
- { type: Echo, log: "first node in graph!" }
- { type: Echo, log: "I'm accessed twice!", label: nodeingraph }Addressing nodes inside graphs could make the resulting control flow less understandable:

Importing is used to modularize the taskflow. The addressing can only happen from parent to child: nodes in a child taskflow cannot address nodes in a parent task flow.
The design of importing taskflows is under discussion and feedback is welcome.
Currently, imports is a map where the key specifies the path of the taskflow to import and the value is unused.
myflow.yf:
name: myflow
imports:
import.yf:
graphs:
start:
- { type: Echo, log: hello, goTo: importedflow.gohere }import.yf:
name: importedflow
graphs:
gohere:
- { type: Echo, log: "I'm here now" }When using more than one specification, the main specification has to be supplied as a command line argument,
e.g. docker run -v "$PWD":/rt/ --rm albsch/scraper:latest myflow.yf.

The core framework is able to generate flow graphs to visualize the specification via the nodes and arrows model.
root.yf:
name: root
imports:
child.yf:
graphs:
start:
- { type: Echo, log: "Starting taskflow"}
- { type: Pipe, pipeTargets: [arg1, root.arg2.0] }
- { type: Echo, log: "Finished: {arg1} {arg2}!"}
arg1:
- { type: Echo, put: elements1, value: ["1","2","4","5","6"] }
- { type: Map, list: "{elements1}", mapTarget: child.api, putElement: a }
- { type: Echo, put: arg1, value: hello}
arg2:
- { type: Echo }
- { type: Echo, put: elements2, value: ["wo", "rl", "d"] }
- { type: Map, list: "{elements2}", mapTarget: child.api, putElement: a }
- { type: Echo, put: arg2, value: "{{elements2}}[0]" }child.yf:
name: child
graphs:
# API
# a :: String
# static checks ensure that this specification is only executed if the importers provide a String at location `a`
api:
- { type: Echo, log: "Got input {a}"}Executing cfg will yield (e.g. docker run -v "$PWD":/rt/ --rm albsch/scraper:latest root.yf cfg exit)

Currently, crossed arrows are depicted as red arrows in the flow graph generator.
Global node configurations can be specified by node type or regex match on node types.
Regex have to be surrounded by //.
name: myflow
globalNodeConfigurations:
Echo:
log: "Global logging"
"/Pip.*/":
log: "My target is tar"
pipeTargets: [tar]
graphs:
start:
- { type: Echo, log: "hello"}
- { type: Pipe }
tar:
- { type: Echo }