Perl - split command with regex - split numeric and strings -
my data follows:
20110627 abc dbe efg 217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312
i trying split in such way keep numeric values in 1 cell, , strings in 1 cell.
thus, want "20110627" in 1 cell, "abc dbe efg" in another, "0.811585416264778" in another, "-0.157081048309312" in another, etc.
i have following split command in perl regex
my @fld = split(/[\d+][\s][\w+]/, $_);
but doesn't seem want.. can tell me regex use? in advance
edit : following vks suggestion, changed regex little bit rid of whitespace, take account string might have commas (,) or slash (/) or dash (-) negative sign (-) seems taken separate token in numbers:
(-?\d+(\.\d+)?)|([\/?,?\.?\-?a-za-z\/ ]+) 20110627 b c 217722 1425 1767 0.654504367955466 0.811585416264778 -0.157081048309312 19950725 c 16458 63 91 0.38279256288735 0.552922590837283 -0.170130027949933 19980323 g c /de/ 20130516 - e, inc. 33019 398 197 1.205366607105 0.596626184923832 0.608740422181168 20130516 - e, inc. 24094 134 137 0.556155059350876 0.56860629202291 -0.0124512326720345 19960327 f c /de 38905 503 169 1.29289294435163 0.434391466392495 0.858501477959131
expected output : 20110627 in 1 token b c in 1 token -0.170130027949933 in 1 token g c /de/ in 1 token - e, inc. in 1 token.. (of course other should in separate tokens, in other words strings in 1 token , numbers in 1 token.. cannot write every single 1 of them think it straightforward)
2nd edit:
brian found right regex: /(-?\d+(?:.\d+)?)|([/,.-a-za-z]+(?:\s+[/,.-a-za-z]+)*)/ (see below). brian ! have follow question: writing results of regex split excel file, using following code:
use warnings; use strict; use spreadsheet::writeexcel; use scalar::util qw(looks_like_number); use spreadsheet::parseexcel; use spreadsheet::parseexcel::saveparser; use spreadsheet::parseexcel::workbook; if (($#argv < 1) || ($#argv > 2)) { die("usage: tab2xls tabfile.txt newfile.xls\n"); }; open (tabfile, $argv[0]) or die "$argv[0]: $!"; $workbook = spreadsheet::writeexcel->new($argv[1]); $worksheet = $workbook->add_worksheet(); $row = 0; $col = 0; while (<tabfile>) { chomp; # split @fld = split(/(-?\d+(?:\.\d+)?)|([\/,.\-a-za-z]+(?:\s+[\/,.\-a-za-z]+)*)/, $_); $col = 0; foreach $token (@fld) { $worksheet->write($row, $col, $token); $col++; } $row++; }
the problem empty cells when use code:
> "empty cell" "1000" "empty cell" "empty cell" "abc deg" "empty cell" > "2500" "empty cell" "empty cell" "1500" "3500"
why getting these empty cells? way avoid that? lot
using revised requirements allow /
, ,
, -
, etc., here's regex capture numeric tokens in capture group #1 , alpha in capture group #2:
(-?\d+(?:\.\d+)?)|([\/,.\-a-za-z]+(?:\s+[\/,.\-a-za-z]+)*)
(see regex101 example)
breakdown:
(-?\d+(?:\.\d+)?)
(capture group #1) matches numbers, possible negative sign , possible decimal places (in non-capturing group)
([\/,.\-a-za-z]+(?:\s+[\/,.\-a-za-z]+)*)
(capture group #2) matches alpha strings possible embedded whitespace
Comments
Post a Comment