Tips for Excel, Word, PowerPoint and Other Applications

Cleaning Data With Notepad

Why It Matters To You

I recently finished a Vendor Management System (VMS) implementation. As a part of the work, I had to create upload files, conforming to template .CSV layouts, to load large data tables into the system. One of these was a list of the cost centers each user was allowed to view and execute transactions for. On the surface, the upload file is pretty simple, except for the way cost centers were required to be chained together. Here's an example. In column C, Cost Centers, multiple cost centers are separated by tildes (~).

You could write a large concatenation formula to chain all of these together, in fact, I started with this, but it as a pain in the ass. Also, as some users may have 50+ cost centers, this becomes pretty unwieldy.

Meanwhile, I had a nice table of cost centers by business unit, so how do I translate one format into the upload format?

How To ...

Before we get started, there are 3 example files:

data_cleaning2.xls - Sample upload template and cost center list.
data_cleaning2.doc - Excel data copied into Word.
data_cleaning2.txt - Cleaned Word data copied into Notepad.

I started off with modifying my table of cost centers a bit. I added a helper column named, Consolidation Code. This was going to give me all of the tildes I needed to separate cost centers and identify for me the first entry for each business unit. The formula I used was, =IF(E4=E3,"~","XXX"&E4&","). All this does is look at a line and compare the business unit to that of the line above. If they are the same, then just enter a tilde. If they aren't the same, then this is the first entry of a new business unit, so enter the name of the business unit, but put an "XXX" in front of it first. The reasoning for this will be clear later. Copy this formula all the way to the bottom of your cost center table.

Now copy and paste Columns A:B as text into Word as Unformatted Text

Turn on the Show/Hide Formatting Symbols. It's on the Standard Toolbar near the end. If you click the button indicated, your Word document will show all of the tabs and paragraph marks (e.g., a hard line break).

This should be what you see. When you copy and paste from Excel, you will typically see paragraph marks at last entry of a row and tab marks separating cell data.

Now we're going to remove all of the tab and paragraph marks. This can't be done in Notepad, which is why we are in Word. Bring up the Search and Replace wizard with CTRL+H. Click on More: Special and you'll see a list of special characters you can search for ... and replace. There are a couple of key codes you probably want to remember: paragraph marks (^p) and tabs (^t).

Search and replace (CTRL+H) all paragraph marks (^p) and tabs (^t) with nothing and then replace all of the XXX with a paragraph mark (^p). The reason we added the XXX might be apparent now. Had we not marked it with anything, we would not have been any systematic way to separate business units from line-wrapped cost centers. With the XXX, we have a ready marker to Searc & Replace on.

After all of your search and replaces are done, your Word document should look like this. However, it still has wrapped line, which will show up as separate rows if we bring it into Excel right now.

So, we take the next step and Copy/Paste the data from Word to Notepad, which eliminates the line wraps. You see it wrapped in the window, but it really isn't. This will copy into Excel as one nice line of data, with business unit separated from the cost centers by a , and all the cost centers separated by tildes (~)

That's it.

Notes

Last updated7/25/08
Application VersionExcel 2003
AuthorMichael Kan
Pre-requisitesNone
Related Tips Cleaning Data With Notepad