UTF-8, ASCII, and that horrible thing called FTPS

So I need to rant a bit here, but also I'm hoping some of this info will help some people out because I realized there is not much info out there on this problem.


Once upon a time, they were inventing ways to connect with other servers and distribute information. FTP was an early version of this kind of server to server protocol. It worked great, but was unencrypted. So SFTP was born. It's a great example of the power of open source. The SFTP library is full-featured and very stable. So why would anyone want to use anything else?

Honestly, I don't know the answer to that. But Microsoft decided to build another thing we don't need. Instead of getting on board, they decided to roll their own thing and create FTPS. Now if you run Microsoft Servers and ASP.NET or what have you, it's a non issue. But if you want to talk to anything open source, you are going to have to bridge that gap somehow. And what I found out was that there are almost no options for you.

you've got: the double bag gem

this 4 year old Ruby hack.

and the ftpfxp gem, which I won't even link to because it didn't work at all for me.

So really, slim pickings. First lesson - if anyone asks about FTPS, tell them to run like hell. I was able to get something going with the double bag gem but it is ugly. It occasionally throws errors for no reason. I've tried Implicit, Explicit, and it is just a buggy protocol I guess. so there it is. SFTP just works. FTPS is a mess. How's that for a triumph of open-source over big Billysoft. 

PART II: ASCII characters

Once upon a time, we were figuring out how best to digitally represent characters, and in the beginning we had ASCII. But then Unicode was invented because standardization is our friend and pretty much everyone moved to UTF-8 except for, you guessed it, Microsoft. So here we are again. The data I get from these horrible FTPS servers every so often throws ASCII-8BIT characters at me. Just to piss me off, I think. It will bring the whole ingest to a halt. So here's how I handled that.

step 1, get the data. Sounds easy but if you have any non-UTF-8 characters in there, the download will fail on a call to 'gettextfile'. So just download it as a binary file. You're still going to have problems parsing it, but for now just get the damn thing to download.

step 2, remove the offending characters. There is by definition no way to represent non-UTF-8 characters in strict UTF-8, so just dump them. I used a function like this:

def remove_ascii(filename)
  outfile = ""
  s = File.open(filename, 'r:ascii-8bit') do |o|
    o.each_char do |x|
        outfile << x.encode(Encoding::UTF_8)
  File.delete filename
  output = File.open( filename,"w" )
  output << outfile

and we are done. I don't feel particularly good about, it's not an elegant solution but I shouldn't be getting those characters in the first  place.

Anyway, good luck out there!