r - Error in read.table duplicate row.names -
when tried read following table dataframe (data100) by:
data100 <- read.table(header=true, text=' verb_object session_id 1: ba31c1cc63e5043483fae25f085e25e5 insert 41595370 2: bece6374d91d47e6285efdeba6d65bb9 database 41595371 3: 26d695c8ca82caffdf985201f3aa44d7 update 41595282 4: 26d695c8ca82caffdf985201f3aa44d7 update 41595282 5: 2bc5a4199a0dda16fa17a9ca1aa17c02 database 41595373 6: 6d944d54c54ed75d487288fe1505bb59 insert 41595368 ') following error: error in read.table(header = true, text = "\n verb_object session_id\n ba31c1cc63e5043483fae25f085e25e5 insert 41595370\n bece6374d91d47e6285efdeba6d65bb9 database 41595371\n 26d695c8ca82caffdf985201f3aa44d7 update 41595282\n 26d695c8ca82caffdf985201f3aa44d7 update 41595282\n 2bc5a4199a0dda16fa17a9ca1aa17c02 database 41595373\n 6d944d54c54ed75d487288fe1505bb59 insert 41595368\n") : duplicate 'row.names' not allowed
how can read it?
after usage of
lines <- readlines(textconnection(" verb_object session_id > data100<-read.table(text=gsub('(?<=\\:)\\s+|\\s+(?=\\s[0-9])', " '", lines, perl=true), sep='', fill=true)
the result followed:
> data100 v1 v2 v3 v4 v5 v6 v7 1 verb_object session_id na na 2 1: ba31c1cc63e5043483fae25f085e25e5 insert 41595370 2: bece6374d91d47e6285efdeba6d65bb9 database 41595371 3 3: 26d695c8ca82caffdf985201f3aa44d7 update 41595282 4: 26d695c8ca82caffdf985201f3aa44d7 update 41595282 4 5: 2bc5a4199a0dda16fa17a9ca1aa17c02 database 41595373 6: 6d944d54c54ed75d487288fe1505bb59 insert 41595368 >
we can read readlines
, place quotes using gsub
, , read read.table
lines <- readlines(textconnection("verb_object session_id 1: ba31c1cc63e5043483fae25f085e25e5 insert 41595370 2: bece6374d91d47e6285efdeba6d65bb9 database 41595371 3: 26d695c8ca82caffdf985201f3aa44d7 update 41595282 4: 26d695c8ca82caffdf985201f3aa44d7 update 41595282 5: 2bc5a4199a0dda16fa17a9ca1aa17c02 database 41595373 6: 6d944d54c54ed75d487288fe1505bb59 insert 41595368")) read.table(text=gsub('(?<=\\:)\\s+|\\s+(?=\\s[0-9])', " '", lines, perl=true), sep='') # verb_object session_id #1: ba31c1cc63e5043483fae25f085e25e5 insert 41595370 #2: bece6374d91d47e6285efdeba6d65bb9 database 41595371 #3: 26d695c8ca82caffdf985201f3aa44d7 update 41595282 #4: 26d695c8ca82caffdf985201f3aa44d7 update 41595282 #5: 2bc5a4199a0dda16fa17a9ca1aa17c02 database 41595373 #6: 6d944d54c54ed75d487288fe1505bb59 insert 41595368
update
the op's new dataset can read readlines
before,
lines <- readlines(textconnection("items newitem 1: ba31c1cc63e5043483fae25f085e25e5 insert ov1 2: bece6374d91d47e6285efdeba6d65bb9 database ov2 3: 26d695c8ca82caffdf985201f3aa44d7 update ov3 4: 2bc5a4199a0dda16fa17a9ca1aa17c02 database ov4 5: 6d944d54c54ed75d487288fe1505bb59 insert ov5"))
we should note pattern matched in earlier dataset (\\s+(?=\\s[0-9])
) won't work here first character in 'sessionid' number, while in 'newitem' uppercase letter. so, match 1 or more characters not :
beginning of string (^[^:]+
) followed :
, followed 1 or more space (\\s+
), capture characters group using parentheses ()
i.e. 1 or more characters not space followed 1 or more space , characters not space (([^ ]+\\s+[^ ]+)
, match 1 or more space (\\s+
) followed 1 or more characters till end of string capture group ((.*)$
). replace placing quotes around first capture group ('\\1'
) followed space followed second capture group.
read.table(text=gsub("^[^:]+:\\s+([^ ]+\\s+[^ ]+)\\s+(.*)$", "'\\1' \\2", lines), header=true) # items newitem #1 ba31c1cc63e5043483fae25f085e25e5 insert ov1 #2 bece6374d91d47e6285efdeba6d65bb9 database ov2 #3 26d695c8ca82caffdf985201f3aa44d7 update ov3 #4 2bc5a4199a0dda16fa17a9ca1aa17c02 database ov4 #5 6d944d54c54ed75d487288fe1505bb59 insert ov5
Comments
Post a Comment