Tuesday, March 20, 2012

Problem with " when Importing large text file

Hi,
I have a large text file that Excel won't load up completely. The file is
CSV but has extra " embedded in a string that causes error when I import it.
Is there a way to eliminate these embedded " with some tool? Below is a
sample of my importing text:
"ExtBillID","extTID","special_request","trip_ticke t","extiid","invoice","extschedid","cycle","formid ","form","formdate","standard_message","message"," extsid","source","who_printed","date_printed","Pri nted","cancelled","why_cancelled","date_cancelled"
"","","","A3001601","","A3001601","301","01"," A-P INV","PRIVATE INVOICE
FORM","20030221","A01","Please make your payment payable to ALLIED MEDICAL
TRANSPORT @. 2570 BUSH STREET, SAN FRANCISCO, CA 94115 THANK
YOU.","","10A1","Transfer","20030221","Y","","","" ,""
"","","","A3001601","","A3001601","301","","A-P INV","PRIVATE INVOICE
FORM","20030228","A01","Please make your payment payable to ALLIED MEDICAL
TRANSPORT @. 2570 BUSH STREET, SAN FRANCISCO, CA 94115 THANK
YOU.","","10A1","Transfer","20030327","Y","","","" ,""
"","","","A3001601","","A3001601","301","02"," A-P INV","PRIVATE INVOICE
FORM","20030323","A04","Please be advised we do ***NOT*** BILL ANY INSURANCE
COMPANIES. This amount is now due and payable to: Allied Medical
Transport.","","10A1","Transfer","20030328","Y","" ,"","",""
"","","","A3001601","","A3001601","301","03"," A-P INV","PRIVATE INVOICE
FORM","20030331","A12","CPMC HOSPITAL BILLING FOR END THE MONTH
WHEELCHAIR/VAN SERVICE BY ALLIED MEDICAL
TRANSPORT.","","10A1","Transfer","20030417","Y","" ,"","",""
"","","","A3001601","","A3001601","301","02"," A-P INV","PRIVATE INVOICE
FORM","20030429","A12","CPMC HOSPITAL BILLING FOR END THE MONTH
WHEELCHAIR/VAN SERVICE BY ALLIED MEDICAL
TRANSPORT.","","10A1","Transfer","20030502","Y","" ,"","",""
"","","","A3001601","","A3001601","301","01"," A-P INV","PRIVATE INVOICE
FORM","20030515","A04","Please be advised we do ***NOT*** BILL ANY INSURANCE
COMPANIES. This amount is now due and payable to: Allied Medical
Transport.","","10A1","Transfer","20030516","Y","" ,"","",""
"","","","A3001601","","A3001601","301","02"," A-P INV","PRIVATE INVOICE
FORM","20030620","A04","Please be advised we do ***NOT*** BILL ANY INSURANCE
COMPANIES. This amount is now due and payable to: Allied Medical
Transport.","","10A1","Transfer","20030710","Y","" ,"","",""
"","","","A3001601","","A3001601","301","","D-HARDC","TICKET HARDCOPY
8x11","20030811","028","FILE TO CLOSED ACCTS
","","10A1","Transfer","20030829","Y","","","" ,""
"","","","A3001719","","A3001719","301","01"," A-P INV","PRIVATE INVOICE
FORM","20030221","A01","Please make your payment payable to ALLIED MEDICAL
TRANSPORT @. 2570 BUSH STREET, SAN FRANCISCO, CA 94115 THANK
YOU.","","10A1","Transfer","20030221","Y","","","" ,""
"","","","A3001719","","A3001719","301","","A-P INV","PRIVATE INVOICE
FORM","20030228","A01","Please make your payment payable to ALLIED MEDICAL
TRANSPORT @. 2570 BUSH STREET, SAN FRANCISCO, CA 94115 THANK
YOU.","","10A2","Transfer","20030327","Y","","","" ,""
"","","","A3001719","","A3001719","301","02"," A-P INV","PRIVATE INVOICE
FORM","20030321","A04","Please be advised we do ***NOT*** BILL ANY INSURANCE
COMPANIES. This amount is now due and payable to: Allied Medical
Transport.","","10A2","Transfer","20030328","Y","" ,"","",""
"","","","A3001719","","A3001719","301","","D-HARDC","TICKET HARDCOPY
8x11","20030407","028","FILE TO CLOSED ACCTS
","","10A2","Transfer","20030829","Y","","","" ,""
Thanks, Alpha
I don't think this is a SQL Server issue... Sounds like an Excel issue.
But since you're here: by definition, CSV files can have double-quotes
around comma-delimited elements. The reason being that you can have items
like this:
Please make your payment payable to ALLIED MEDICAL
TRANSPORT @. 2570 BUSH STREET, SAN FRANCISCO, CA 94115 THANK
YOU
And without double-quotes delimiting it, the application reading the file
will see 3 different comma-delimited elements when in fact it's only one
element. It sounds like you might have an extra " in there somewhere.
Perhaps one of your items is something like this:
8"x11"
In which case the internal quotes are going to cause problems when the app
tries to read them:
"8"x11""
If your data already contains quotes in it somewhere, and you can arrange
it, TAB-delimited format might work better for you...
"Alpha" <Alpha@.discussions.microsoft.com> wrote in message
news:5ACB9501-43FC-4315-95C6-0548D370A844@.microsoft.com...
> Hi,
> I have a large text file that Excel won't load up completely. The file is
> CSV but has extra " embedded in a string that causes error when I import
> it.
> Is there a way to eliminate these embedded " with some tool? Below is a
> sample of my importing text:
> "ExtBillID","extTID","special_request","trip_ticke t","extiid","invoice","extschedid","cycle","formid ","form","formdate","standard_message","message"," extsid","source","who_printed","date_printed","Pri nted","cancelled","why_cancelled","date_cancelled"
> "","","","A3001601","","A3001601","301","01"," A-P INV","PRIVATE INVOICE
> FORM","20030221","A01","Please make your payment payable to ALLIED MEDICAL
> TRANSPORT @. 2570 BUSH STREET, SAN FRANCISCO, CA 94115 THANK
> YOU.","","10A1","Transfer","20030221","Y","","","" ,""
> "","","","A3001601","","A3001601","301","","A-P INV","PRIVATE INVOICE
> FORM","20030228","A01","Please make your payment payable to ALLIED MEDICAL
> TRANSPORT @. 2570 BUSH STREET, SAN FRANCISCO, CA 94115 THANK
> YOU.","","10A1","Transfer","20030327","Y","","","" ,""
> "","","","A3001601","","A3001601","301","02"," A-P INV","PRIVATE INVOICE
> FORM","20030323","A04","Please be advised we do ***NOT*** BILL ANY
> INSURANCE
> COMPANIES. This amount is now due and payable to: Allied Medical
> Transport.","","10A1","Transfer","20030328","Y","" ,"","",""
> "","","","A3001601","","A3001601","301","03"," A-P INV","PRIVATE INVOICE
> FORM","20030331","A12","CPMC HOSPITAL BILLING FOR END THE MONTH
> WHEELCHAIR/VAN SERVICE BY ALLIED MEDICAL
> TRANSPORT.","","10A1","Transfer","20030417","Y","" ,"","",""
> "","","","A3001601","","A3001601","301","02"," A-P INV","PRIVATE INVOICE
> FORM","20030429","A12","CPMC HOSPITAL BILLING FOR END THE MONTH
> WHEELCHAIR/VAN SERVICE BY ALLIED MEDICAL
> TRANSPORT.","","10A1","Transfer","20030502","Y","" ,"","",""
> "","","","A3001601","","A3001601","301","01"," A-P INV","PRIVATE INVOICE
> FORM","20030515","A04","Please be advised we do ***NOT*** BILL ANY
> INSURANCE
> COMPANIES. This amount is now due and payable to: Allied Medical
> Transport.","","10A1","Transfer","20030516","Y","" ,"","",""
> "","","","A3001601","","A3001601","301","02"," A-P INV","PRIVATE INVOICE
> FORM","20030620","A04","Please be advised we do ***NOT*** BILL ANY
> INSURANCE
> COMPANIES. This amount is now due and payable to: Allied Medical
> Transport.","","10A1","Transfer","20030710","Y","" ,"","",""
> "","","","A3001601","","A3001601","301","","D-HARDC","TICKET HARDCOPY
> 8x11","20030811","028","FILE TO CLOSED ACCTS
> ","","10A1","Transfer","20030829","Y","","","" ,""
> "","","","A3001719","","A3001719","301","01"," A-P INV","PRIVATE INVOICE
> FORM","20030221","A01","Please make your payment payable to ALLIED MEDICAL
> TRANSPORT @. 2570 BUSH STREET, SAN FRANCISCO, CA 94115 THANK
> YOU.","","10A1","Transfer","20030221","Y","","","" ,""
> "","","","A3001719","","A3001719","301","","A-P INV","PRIVATE INVOICE
> FORM","20030228","A01","Please make your payment payable to ALLIED MEDICAL
> TRANSPORT @. 2570 BUSH STREET, SAN FRANCISCO, CA 94115 THANK
> YOU.","","10A2","Transfer","20030327","Y","","","" ,""
> "","","","A3001719","","A3001719","301","02"," A-P INV","PRIVATE INVOICE
> FORM","20030321","A04","Please be advised we do ***NOT*** BILL ANY
> INSURANCE
> COMPANIES. This amount is now due and payable to: Allied Medical
> Transport.","","10A2","Transfer","20030328","Y","" ,"","",""
> "","","","A3001719","","A3001719","301","","D-HARDC","TICKET HARDCOPY
> 8x11","20030407","028","FILE TO CLOSED ACCTS
> ","","10A2","Transfer","20030829","Y","","","" ,""
>
> Thanks, Alpha
|||The problem is the extra " embedded in a string. For example, "This is the
carrier"s responsibility."
Excel is just to help me find the extra " but it's the text file format that
I'm importing. Are you suggesting for me to change
it to Tab delimited? I don't have a way of doing this. The files are given
to me and I can't load it in Excel completely to change
the delimiter to tab.
Thanks,
Alpha
"Michael C#" wrote:
[vbcol=seagreen]
> I don't think this is a SQL Server issue... Sounds like an Excel issue.
> But since you're here: by definition, CSV files can have double-quotes
> around comma-delimited elements. The reason being that you can have items
> like this:
> Please make your payment payable to ALLIED MEDICAL
> TRANSPORT @. 2570 BUSH STREET, SAN FRANCISCO, CA 94115 THANK
> YOU
> And without double-quotes delimiting it, the application reading the file
> will see 3 different comma-delimited elements when in fact it's only one
> element. It sounds like you might have an extra " in there somewhere.
> Perhaps one of your items is something like this:
> 8"x11"
> In which case the internal quotes are going to cause problems when the app
> tries to read them:
> "8"x11""
> If your data already contains quotes in it somewhere, and you can arrange
> it, TAB-delimited format might work better for you...
> "Alpha" <Alpha@.discussions.microsoft.com> wrote in message
> news:5ACB9501-43FC-4315-95C6-0548D370A844@.microsoft.com...

>
>
|||That was my recommendation - as I said, "if you can arrange it" - convert it
to Tab-delimited. But, that aside, they didn't give you a true CSV file if
they embedded extra quotes inside the quoted strings and there's not a whole
heckuva lot you can do about that, other than 1) write a program to read the
file, recognize which fields have extra quotes, and modify them or 2) figure
out which lines have bad data and manually fix them.
Of course if you receive more of these files on a regular basis you'll more
than likely have the same problem. I would contact the sender - *if
possible* - and ask them to send clean data, or ask if they could send it in
Tab-delimited, pipe-delimited or some other format. That or get cracking on
a program that can read their data the way it was meant, as opposed to how
it was sent.
"Alpha" <Alpha@.discussions.microsoft.com> wrote in message
news:393A48C7-10FC-4DC2-85C0-9468FABE8A6F@.microsoft.com...[vbcol=seagreen]
> The problem is the extra " embedded in a string. For example, "This is
> the
> carrier"s responsibility."
> Excel is just to help me find the extra " but it's the text file format
> that
> I'm importing. Are you suggesting for me to change
> it to Tab delimited? I don't have a way of doing this. The files are
> given
> to me and I can't load it in Excel completely to change
> the delimiter to tab.
> Thanks,
> Alpha
> "Michael C#" wrote:
[vbcol=seagreen]
|||Thank you. I think I will break the file into several files and then import
it to Excel to clean it up and then import them one by one. I think
that will work.
Thanks,
Alpha
"Michael C#" wrote:
[vbcol=seagreen]
> That was my recommendation - as I said, "if you can arrange it" - convert it
> to Tab-delimited. But, that aside, they didn't give you a true CSV file if
> they embedded extra quotes inside the quoted strings and there's not a whole
> heckuva lot you can do about that, other than 1) write a program to read the
> file, recognize which fields have extra quotes, and modify them or 2) figure
> out which lines have bad data and manually fix them.
> Of course if you receive more of these files on a regular basis you'll more
> than likely have the same problem. I would contact the sender - *if
> possible* - and ask them to send clean data, or ask if they could send it in
> Tab-delimited, pipe-delimited or some other format. That or get cracking on
> a program that can read their data the way it was meant, as opposed to how
> it was sent.
> "Alpha" <Alpha@.discussions.microsoft.com> wrote in message
> news:393A48C7-10FC-4DC2-85C0-9468FABE8A6F@.microsoft.com...
d"
>
>
|||If you find that you have to handle these files on a routine basis, you
might see what my csv parser can do for you. It is not free, but it is
made to attempt to handle what would normally be considered invalid
unescaped double quotes inside text qualified/double quote delimited
data specifically because I've heard odd rumors, and I think at one
point was even able to recreate this problem using FoxPro.
http://www.geocities.com/shriop/index.html You could use it to loop
over the data and either export it back out in proper csv that sql
server would handle, or change the delimiter all together.
|||You could write a program that attempts to figure it out yourself.
Basically you would:
1) Read in a line
2) Scan it character by character
2a) If the following conditions are true, then you are at the end of a
field: a) the current character is a quote, the next character is a comma,
and the next character after that is a quote or an end-of-line character
2b) If the following conditions are true, then you are at the beginning
of a field: a) the current character is a quote, the last character is a
comma and the character before that is a quote; or this quote character is
the first character of the line
3) Once you've determined the beginning and end of a field, write the field
out to a new file sans the extraneous quotes
Using this simple logic *should* cover 99% of your screwy CSV file. You
could then choose to write the file out to a Tab-Delimited, or other
delimited format if you want to keep the internal quotes; or you could write
it back out to a correct CSV format by eliminating the internal quotes.
Like I said, if I were you, I would definitely see if I could get clean
data - if you expect to receive any future data from these folks that is.
Will save you some headaches in the long run.
"Alpha" <Alpha@.discussions.microsoft.com> wrote in message
news:E4C5A196-AC27-4459-9359-2CF7E2667FCD@.microsoft.com...[vbcol=seagreen]
> Thank you. I think I will break the file into several files and then
> import
> it to Excel to clean it up and then import them one by one. I think
> that will work.
> Thanks,
> Alpha
> "Michael C#" wrote:
ed"[vbcol=seagreen]
|||Yes, I will give that a try. Thank you very much.
"Michael C#" wrote:
[vbcol=seagreen]
> You could write a program that attempts to figure it out yourself.
> Basically you would:
> 1) Read in a line
> 2) Scan it character by character
> 2a) If the following conditions are true, then you are at the end of a
> field: a) the current character is a quote, the next character is a comma,
> and the next character after that is a quote or an end-of-line character
> 2b) If the following conditions are true, then you are at the beginning
> of a field: a) the current character is a quote, the last character is a
> comma and the character before that is a quote; or this quote character is
> the first character of the line
> 3) Once you've determined the beginning and end of a field, write the field
> out to a new file sans the extraneous quotes
> Using this simple logic *should* cover 99% of your screwy CSV file. You
> could then choose to write the file out to a Tab-Delimited, or other
> delimited format if you want to keep the internal quotes; or you could write
> it back out to a correct CSV format by eliminating the internal quotes.
> Like I said, if I were you, I would definitely see if I could get clean
> data - if you expect to receive any future data from these folks that is.
> Will save you some headaches in the long run.
> "Alpha" <Alpha@.discussions.microsoft.com> wrote in message
> news:E4C5A196-AC27-4459-9359-2CF7E2667FCD@.microsoft.com...
lled"
>
>
|||A different question, comma (,) is the delimiter currently used and double
quote is the text qualifier. If I have a string
"Please send the check, thanks", would the comma enclosed in the double
quote caused a problem when I'm importing to a table?
Thanks.
"Michael C#" wrote:
[vbcol=seagreen]
> You could write a program that attempts to figure it out yourself.
> Basically you would:
> 1) Read in a line
> 2) Scan it character by character
> 2a) If the following conditions are true, then you are at the end of a
> field: a) the current character is a quote, the next character is a comma,
> and the next character after that is a quote or an end-of-line character
> 2b) If the following conditions are true, then you are at the beginning
> of a field: a) the current character is a quote, the last character is a
> comma and the character before that is a quote; or this quote character is
> the first character of the line
> 3) Once you've determined the beginning and end of a field, write the field
> out to a new file sans the extraneous quotes
> Using this simple logic *should* cover 99% of your screwy CSV file. You
> could then choose to write the file out to a Tab-Delimited, or other
> delimited format if you want to keep the internal quotes; or you could write
> it back out to a correct CSV format by eliminating the internal quotes.
> Like I said, if I were you, I would definitely see if I could get clean
> data - if you expect to receive any future data from these folks that is.
> Will save you some headaches in the long run.
> "Alpha" <Alpha@.discussions.microsoft.com> wrote in message
> news:E4C5A196-AC27-4459-9359-2CF7E2667FCD@.microsoft.com...
lled"
>
>
|||No, the comma won't cause a problem if you're using the sql server
import data wizard or a dts package. It will cause problems if you're
using bcp or bulk insert.

No comments:

Post a Comment