php - preg_split regex lookback for multiple matches -
the goal of regex split on unicode whitespace, excluding newline , ensure that newline character appended previous non unicode whitespace character. seeing work, single whitespace characters before \n.
using current regex:
$data = "the\nquick\n brown fox jumped \nover lazy dog."; $tokenized = preg_split("~(?<=\n)|\p{z}+(?!\n)~u", $data, -1, preg_split_offset_capture);
current result (i have added \n "\n" character present):
array ( [0] => array ( [0] => the\n [1] => 0 ) [1] => array ( [0] => quick\n [1] => 4 ) [2] => array ( [0] => [1] => 10 ) [3] => array ( [0] => brown [1] => 11 ) [4] => array ( [0] => fox [1] => 17 ) [5] => array ( [0] => jumped [1] => 21 ) [6] => array ( [0] => \n [1] => 31 ) [7] => array ( [0] => on [1] => 33 ) [8] => array ( [0] => [1] => 38 ) [9] => array ( [0] => lazy [1] => 42 ) [10] => array ( [0] => dog. [1] => 47 ) )
expected result:
array ( [0] => array ( [0] => the\n [1] => 0 ) [1] => array ( [0] => quick\n [1] => 4 ) [2] => array ( [0] => brown [1] => 10 ) [3] => array ( [0] => fox [1] => 16 ) [4] => array ( [0] => jumped\n [1] => 20 ) [5] => array ( [0] => on [1] => 27 ) [6] => array ( [0] => [1] => 32 ) [7] => array ( [0] => lazy [1] => 36 ) [8] => array ( [0] => dog. [1] => 41 ) )
any advice appreciated. thanks.
Comments
Post a Comment