{"id":248,"date":"2015-05-29T18:27:19","date_gmt":"2015-05-29T18:27:19","guid":{"rendered":"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/?p=248"},"modified":"2015-09-13T09:07:45","modified_gmt":"2015-09-13T09:07:45","slug":"reading-variable-length-event-sources","status":"publish","type":"post","link":"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/?p=248","title":{"rendered":"Reading variable-length event sources"},"content":{"rendered":"<p>Some data sources, such as <code>mod_udp<\/code>, only know how to fetch data in fixed-size blocks. The modules that take those data and dissect them into Orchids events, on the other hand, may require a variable number of bytes. \u00a0The <code>mod_utils.[ch]<\/code> files provide a general-purpose API to solve this impedance match problem: the <code>blox<\/code> API.<!--more--><\/p>\n<p>There are three\u00a0simple ways the length can be specified, in principle:<\/p>\n<ul>\n<li>implicitly: all required data blocks have exactly the same size (this seems to be the case of no event record format known to OrchIDS);<\/li>\n<li>explicitly: the first few bytes, say, contain the number of bytes to be read (surprisingly, no event record format known to OrchIDS does something as simple as that);<\/li>\n<li>by end of record\u00a0character: every byte until the terminator character is taken to form the data block (e.g., the <code>mod_bintotext<\/code>\u00a0module works this way, considering the newline character <code>'\\n'<\/code> as terminator).<\/li>\n<\/ul>\n<p>The\u00a0<code>blox<\/code>\u00a0API was initially\u00a0meant to solve the problem in the context of the <code>mod_openbsm<\/code> module, for which finding the length is a bit more complicated. However, it is suited to solve the length problem in all three cases above as well, as we shall see, partly, at the end of this post.<\/p>\n<h3>Automata<\/h3>\n<p>Let us explain the\u00a0<code>mod_openbsm<\/code> case in more detail:\u00a0the first byte is a type tag, and depending on that type, we find the length in different ways. In the first 4 cases, the length is given by the next 4 bytes, in big-endian format (including the already read 5 bytes). In a fifth case, the next 8 bytes are a time value, and the following 2 bytes hold the length of the subsequent file name (<em>excluding<\/em> the already read 11 bytes) in big-endian format.<\/p>\n<p>Reading the length, and in fact the whole data block, can be described by the following automaton:<\/p>\n<ul>\n<li>There are four states,\u00a0<code>BLOX_INIT<\/code>,\u00a0<code>BLOX_NOT_ALIGNED<\/code>, <code>BLOX_FINAL<\/code> (those three are all predefined in the <code>blox<\/code> library, as numbers 0, 1, 2 respectively),\u00a0<code>STATE_HEADER<\/code>, and\u00a0<code>STATE_FILE_EXPECT_FILENAMELEN<\/code>\u00a0(defined in\u00a0<code>mod_openbsm<\/code>).<\/li>\n<li>The initial state is\u00a0<code>BLOX_INIT<\/code>. In that state, we have read the first byte of the data block.<\/li>\n<li>When in state\u00a0<code>BLOX_INIT<\/code>, we look at the first byte. There are 5 legal values for this byte. \u00a0In the first 4 cases, we go to state <code>STATE_HEADER<\/code>, and request to read 5 bytes (that is, 4 extra bytes: we have already read 1 byte). In the fifth case, we go to state <code>STATE_FILE_EXPECT_FILENAMELEN<\/code>, and request to read 11 bytes. If the character read does not match any of the previous cases, we go to state <code>BLOX_NOT_ALIGNED<\/code>, requesting to resynchronize the data: throw away whatever we have read, re-read one byte and go back to state <code>BLOX_INIT<\/code>. \u00a0(Resynchronizing is done automatically by the\u00a0<code>blox<\/code>\u00a0engine, and is implemented in the provided function\u00a0<code>blox_dissect()<\/code>. \u00a0However, you must describe the other actions.)<\/li>\n<li>When in state\u00a0<code>STATE_HEADER<\/code>, we have read 5 bytes. \u00a0We interpret the last 4 bytes as a length\u00a0<em>n<\/em>: we request to read <em>n<\/em><i>\u00a0<\/i>bytes (including the 5 bytes we have already read), and go to state\u00a0<code>BLOX_FINAL<\/code>: we have finished our task, the <code>blox<\/code> engine will make sure that we have read <em>n<\/em>\u00a0bytes, and pass it on (to the\u00a0<i>subdissector<\/i>, see below).<\/li>\n<li>When in state\u00a0<code>STATE_FILE_EXPECT_FILENAMELEN<\/code>, we have read 11 bytes. \u00a0We interpret the last 2 bytes as a length\u00a0<em>m<\/em>: we request to read\u00a0<em>m<\/em>\u00a0more bytes (excluding the 11 bytes we have already read: so we request to read\u00a0<em>m<\/em>+11 bytes in total), and go to state\u00a0<code>BLOX_FINAL<\/code>, again.<\/li>\n<\/ul>\n<h3>The blox API<\/h3>\n<p>This automaton is described by a function of the following type, which you must provide (in the case of <code>mod_openbsm<\/code>, this function is called <code>openbsm_compute_length<\/code>):<\/p>\n<pre>typedef size_t (*compute_length_fun) (unsigned char *first_bytes,\r\n                                      size_t n_first_bytes,\r\n                                      size_t available_bytes,\r\n                                      int *state, \/* pointer to blox state *\/\r\n                                      void *sd_data);<\/pre>\n<p>When your function, of that type, is called, <code>first_bytes<\/code> will point to a memory zone that holds <code>available_bytes<\/code> bytes; this number is always larger than or equal to the previously required number of bytes, <code>n_first_bytes<\/code>. (If you request 11 bytes, the <code>mod_udp<\/code> module may decide to read 1024 bytes instead, for example.) The integer pointed to by <code>state<\/code> is the current state. Your function&#8217;s task will be to update the state by storing the new state into <code>*state<\/code>, and return the requested number of bytes to read (5, or 11, for example, in the <code>open_bsm<\/code> case). The pointer <code>sd_data<\/code> contains private data that is passed to each invocation of your function: do whatever you please with it.<\/p>\n<p>Once we have reached the\u00a0<code>BLOX_FINAL<\/code> state (by storing it into\u00a0<code>*state<\/code>), the\u00a0<code>blox<\/code>\u00a0engine will call a\u00a0<i>subdissector<\/i> function, which you must provide, too, and is of the following type (in the\u00a0<span style=\"font-family: Consolas, Monaco, 'Lucida Console', monospace;\"><span style=\"font-size: 12px;\">openbsm<\/span><\/span>\u00a0case, this is <code>openbsm_subdissect()<\/code>):<\/p>\n<pre>typedef void (*subdissect_fun) (orchids_t *ctx, mod_entry_t *mod,\r\n                                event_t *event,\r\n                                ovm_var_t *delegate,\r\n                                unsigned char *stream,\r\n                                size_t stream_len,\r\n                                void *sd_data,\r\n                                int dissector_level);<\/pre>\n<p>When your subdissector is called, <code>stream<\/code> will hold a pointer to <code>stream_len<\/code> bytes, holding a complete data block. You must now chop this block in pieces, enriching the <a title=\"Converting input into events\" href=\"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/?p=267\">event<\/a> (list of field\/value pairs) <code>event<\/code>. This works just like an ordinary dissector.<\/p>\n<p>The <code>mod<\/code>\u00a0value points to the current module (<code>mod_openbsm<\/code> in our example), <code>sd_data<\/code> is the same pointer to private data that we mentioned above.<\/p>\n<p>The <code>delegate<\/code> value is a bit more mysterious. \u00a0In the\u00a0<code>mod_openbsm<\/code>\u00a0example again, the data block will be part of an OrchIDS binary string <code>str<\/code> (or a virtual binary string). \u00a0Some the field\/value pairs will include substrings of it. \u00a0It is interesting to create those substrings as\u00a0<em>virtual<\/em> strings. \u00a0For that, the <a title=\"virtual strings\" href=\"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/?p=53\"><code>ovm_vstr_new()<\/code><\/a> and <a title=\"virtual binary strings\" href=\"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/?p=43\"><code>ovm_vbstr_new()<\/code><\/a> functions require a delegate: this is the <code>delegate<\/code> value. Most often, <code>delegate<\/code> will be the string <code>str<\/code>. However, if <code>str<\/code> is itself virtual, <code>delegate<\/code> might be its own delegate instead.<\/p>\n<p>Finally, the\u00a0<code>dissector_level<\/code>\u00a0value holds the number of nested dissectors called until now on the current data source. \u00a0You don&#8217;t need to know anything about it, except that you should pass it along to calls to further subdissectors, and to <a title=\"Converting input into events\" href=\"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/?p=267\"><code>post_event()<\/code><\/a>\u00a0and\u00a0<a title=\"Converting input into events\" href=\"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/?p=267\"><code>REGISTER_EVENTS()<\/code><\/a>\u00a0(which themselves\u00a0may call subdissectors, and will do so with a value of <code>dissector_level+1<\/code>). \u00a0The\u00a0<code>blox<\/code>\u00a0API uses it to register itself\u00a0into\u00a0<a title=\"The event dispatcher\" href=\"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/?p=212\">the\u00a0<code>rtactionlist<\/code>\u00a0priority queue<\/a>, with a priority equal to\u00a0<code>dissector_level*128<\/code>. \u00a0This ensures that <code>blox<\/code>\u00a0dissectors consume their input faster than this input is produced.<\/p>\n<p>Only two small tasks remain. \u00a0We must write our dissector: this will just be a simple call to the following function, provided by\u00a0the\u00a0<code>blox<\/code>\u00a0API:<\/p>\n<pre>int blox_dissect(orchids_t *ctx, mod_entry_t *mod, event_t *event,\r\n                 void *sd_data, int dissector_level);<\/pre>\n<p>For example, the dissector of the\u00a0<code>mod_openbsm<\/code>\u00a0module is:<\/p>\n<pre>static int openbsm_dissect (orchids_t *ctx, mod_entry_t *mod,\r\n                            event_t *event, void *data, int dissector_level)\r\n{\r\n  return blox_dissect (ctx, mod, event, data, dissector_level);\r\n}<\/pre>\n<p>And we must also make sure that we have initialized an instance of the\u00a0<code>blox<\/code>\u00a0API for each possible matching\u00a0source, using the following function:<\/p>\n<pre>blox_hook_t *init_blox_hook(orchids_t *ctx,\r\n                            blox_config_t *bcfg,\r\n                            char *tag,\r\n                            size_t taglen);<\/pre>\n<p>This returns a pointer to a\u00a0<code>blox_hook_t<\/code>\u00a0structure, which holds various buffers and flags. \u00a0Each pair of a source and a blox dissector should have its own\u00a0<code>blox_hook_t<\/code>\u00a0structure. \u00a0It is therefore natural to call <code>init_blox_hook()<\/code>\u00a0for each\u00a0<code>DISSECT<\/code>\u00a0directive. \u00a0This is done by installing a\u00a0<em>pre-dissection<\/em> hook in the\u00a0<code>input_module_t<\/code>\u00a0structure describing the module we are creating. \u00a0For example,\u00a0the pre-dissector of the\u00a0<code>mod_openbsm<\/code>\u00a0module is:<\/p>\n<pre>static void *openbsm_predissect(orchids_t *ctx, mod_entry_t *mod,\r\n                                char *parent_modname,\r\n                                char *cond_param_str,\r\n                                int cond_param_size)\r\n{\r\n  blox_hook_t *hook;\r\n\r\n  hook = init_blox_hook (ctx, mod-&gt;config, cond_param_str, cond_param_size);\r\n  return hook;\r\n}<\/pre>\n<p>The OrchIDS engine will make sure that the hook returned by the pre dissector will be passed on to the <code>blox<\/code>\u00a0API, so that it knows which buffers and flags pertain to which input\/dissector pair.<\/p>\n<p>Finally, the\u00a0<code>bcfg<\/code>\u00a0argument to\u00a0<code>init_blox_hook()<\/code>\u00a0holds configuration information for the whole dissector module (not for each one if its instances). \u00a0You obtain it by calling:<\/p>\n<pre>blox_config_t *init_blox_config(orchids_t *ctx,\r\n                                mod_entry_t *mod,\r\n                                size_t n_first_bytes,\r\n                                compute_length_fun compute_length,\r\n                                subdissect_fun subdissect,\r\n                                void *sd_data\r\n                                );<\/pre>\n<p>Here, <code>n_first_bytes<\/code> is the number of bytes that should be read each time <code>blox_dissect()<\/code> is called. In the case of the <code>mod_openbsm<\/code>\u00a0module, we only need to read one byte. For other formats, we may need to read 4 bytes holding a length, for example.<\/p>\n<p>The <code>compute_length<\/code> and <code>subdissect<\/code> function arguments are those we have described above, and this is how we inform the <code>blox<\/code>\u00a0engine what those functions are. \u00a0Finally,<span style=\"line-height: 1.714285714; font-size: 1rem;\">\u00a0<\/span><code style=\"line-height: 1.714285714;\">sd_data<\/code><span style=\"line-height: 1.714285714; font-size: 1rem;\"> is the private pointer that will be passed to both.<\/span><\/p>\n<p>Again in the case of the <code>mod_openbsm<\/code> module, this initialization is done in the preconfiguration function below (the call to <code>register_fields()<\/code> is meant to register all fields known to <code>mod_openbsm<\/code>, and is\u00a0not directly relevant to this post):<\/p>\n<pre>static void *openbsm_preconfig(orchids_t *ctx, mod_entry_t *mod)\r\n{\r\n  blox_config_t *bcfg;\r\n\r\n  register_fields(ctx, mod, openbsm_fields, OPENBSM_FIELDS);\r\n  bcfg = init_blox_config (ctx, mod, 1,\r\n                           openbsm_compute_length,\r\n                           openbsm_subdissect,\r\n                           NULL);\r\n  return bcfg;\r\n}<\/pre>\n<p>Returning\u00a0<code>bcfg<\/code>\u00a0makes sure it will be stored into the <code>config<\/code> field of the module <code>mod<\/code>: we retrieve it as <code>mod-&gt;config<\/code> in the call we have made above to <code>init_blox_hook()<\/code>.<\/p>\n<h3>Other uses of the blox API<\/h3>\n<p>We have said that\u00a0the\u00a0<code>blox<\/code>\u00a0API could be used for more general purposes. \u00a0Let us give the example of the <code>mod_bintotext<\/code> module, which converts blocks of binary data into sequences of lines terminated by the newline character <code>\\n<\/code>.<\/p>\n<p>In state <code>BLOX_INIT<\/code>, the <code>bintotext_compute_length()<\/code> function looks for a newline character <code>\\n<\/code> inside the <code>first_bytes<\/code> array, of length <code>available_bytes<\/code>. (We reuse the same argument\u00a0names as in the length computing function of the <code>mod_openbsm<\/code> module.) Note that we only require to read 1 byte, just as in the <code>mod_openbsm<\/code> case, but <code>available_bytes<\/code> may be much larger: typically 1024 bytes will be available for binary data coming from a binary file or a UDP socket.<\/p>\n<p>If the newline character was found, then <code>bintotext_compute_length()<\/code> goes to the <code>BLOX_FINAL<\/code> state, and returns the offset of the first character past the newline character. \u00a0This provides the subdissector with the first line of text, including the final newline.<\/p>\n<p>If the newline character was not found, then <code>bintotext_compute_length()<\/code> goes to a new state, <code>BLOX_NSTATES + available_bytes<\/code>. This funny state number is guaranteed to be larger than or equal to <code>BLOX_NSTATES<\/code>, the number of reserved states (<code>BLOX_INIT<\/code>, <code>BLOX_FINAL<\/code>, and <code>BLOX_NOT_ALIGNED<\/code>). Adding <code>available_bytes<\/code>\u00a0to <code>BLOX_NSTATES<\/code>\u00a0allows us to remember the number of bytes we had available in which no newline could\u00a0be found. \u00a0(We could also have used the <code>sd_data<\/code>\u00a0pointer to that purpose.) \u00a0The <code>bintotext_compute_length()<\/code>\u00a0function then requests just one byte more, i.e., returns <code>available_bytes+1<\/code>.<\/p>\n<p>When control is returned to <code>bintotext_compute_length()<\/code> in some state other than <code>BLOX_INIT<\/code>, <code>BLOX_FINAL<\/code>, and <code>BLOX_NOT_ALIGNED<\/code> (say, <code>BLOX_NSTATES + 1024<\/code>), with a new value of <code>available_bytes<\/code> (say, 2048), we look for a newline in the yet unexplored part of the character array <code>first_bytes<\/code> (i.e., between offsets 1024 included and 2048 excluded, in our example), and we proceed as above.<\/p>\n<p>When <code>BLOX_FINAL<\/code> is reached, the <code>blox<\/code>API will then call our subdissector. \u00a0This merely takes the <code>stream_len<\/code> first bytes of the <code>stream<\/code> character array and makes then a virtual string, associated with the <code>.bintotext.line<\/code> field. (We take the same variable names as in the <code>mod_bsm<\/code> example; note that we do not subtract 1 from <code>stream_len<\/code>, so that the trailing newline is kept.)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Some data sources, such as mod_udp, only know how to fetch data in fixed-size blocks. The modules that take those data and dissect them into Orchids events, on the other hand, may require a variable number of bytes. \u00a0The mod_utils.[ch] files provide a general-purpose API to solve this impedance match problem: the blox API.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9,8],"tags":[],"class_list":["post-248","post","type-post","status-publish","format-standard","hentry","category-dissection","category-event-management"],"_links":{"self":[{"href":"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/index.php?rest_route=\/wp\/v2\/posts\/248","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=248"}],"version-history":[{"count":19,"href":"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/index.php?rest_route=\/wp\/v2\/posts\/248\/revisions"}],"predecessor-version":[{"id":292,"href":"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/index.php?rest_route=\/wp\/v2\/posts\/248\/revisions\/292"}],"wp:attachment":[{"href":"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=248"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=248"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/projects.lsv.ens-paris-saclay.fr\/orchidsdev\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=248"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}