|
I've been asking around but I haven't got any answers. I know this community is very technically proficient and intelligent, so I come to you. I understand that this is random but please help!
I have been given the task to clean up a contact list of a survey of over 6,000 people! Some of them are duplicate names or emails so they must be removed. HOWEVER, the problem is that I cannot simply delete a duplicate name or email because the email or name might be a duplicate but the cell phone numbers are different! Thus, I am required to combine both cell phone numbers under the cell phone column FIRST before I delete the duplicate.
Here is an example to clearly convey the issue:
![[image loading]](http://i.imgur.com/vFmCp.png)
In this example, we have two duplicate names (yes, people can have the same names but in this case imagine that they are the same person listed twice), Mark and DJ. I need to delete one, but before I do I need to make sure I am not deleting vital information in the process. In this case they have two different cell phone numbers but they are not collected under one name. I need to keep the two different cell phone numbers and then delete the duplicate names.
The thing is I can do it manually but it would take me forever to go over the hundreds of duplicates in the Excel. I am looking for the most efficient way to tackle this problem. I hope it is okay that I ask this in my blog. I know we have a General Tech but I am pretty sure I can't ask Excel questions as that would be blasphemous in a gaming forum.
Thanks for listening!
   
|
|
|
I think you missed the 6000 part.
OP, I don't suppose you know any VBA?
|
Calgary25969 Posts
I use Excel a llllllllllloooooooooooooooottttttttttttttt and you should just sort and do it manually. No other method is going to be 100% accurate if you can't be sure of how your data is formatted.
|
I tried this but I was not getting anywhere because there were so many different categories to change and combine (the cell phone was just simple example) and there are over 700 duplicates.
|
Calgary25969 Posts
On December 16 2011 04:10 Glacierz wrote:I think you missed the 6000 part. OP, I don't suppose you know any VBA? I know VBA. What do you need help with specifically?
|
On December 16 2011 04:11 Chill wrote: I use Excel a llllllllllloooooooooooooooottttttttttttttt and you should just sort and do it manually. No other method is going to be 100% accurate if you can't be sure of how your data is formatted.
Thanks Chill, I guess I just better get started and chip away.
Yeah I know how to remove duplicates, that is the easy part
|
When you say duplicate, are the content in both cells exactly the same? as in if I did something like "=IF(A1=A2, 1, 0) it would return 1?
|
Calgary25969 Posts
Oh okay I just saw what you're looking for (imgur is blocked at work, saw on my phone).
Basically you want to have a unique list of names and then add the other fields together? PM me the file I'll write you a script.
The problem you're going to have is formatting consistency. If someone has John Smith and another one is john smith and lastly smith, john they're all going to be different.
If your data is formatted uniformly then I can do it. Otherwise, manually is the best.
|
United States1967 Posts
I can't see the image, since imgur is blocked at work, but I'm sure there's a way to automate it. I'll take a look when I get home.
|
On December 16 2011 04:15 Glacierz wrote: When you say duplicate, are the content in both cells exactly the same? as in if I did something like "=IF(A1=A2, 1, 0) it would return 1?
Yes, some of the content in two different cells might be exactly-exactly the same but it does not mean that the other information in corresponding row are the same as well. They can be different and so I must combine those differences into one row and match it with just one name.
|
Isn't there a way to use the A-Z row sort function such that the columns stay locked? (or vice versa) If you did that, then you could just merge all the duplicate names.
That's all that's needed, and with all the crazy functions Excel has I would be shocked if that wasn't possible.
|
With VBA, you can write a nested loop after the names are sorted in alphabetical order. You keep an expanding collection of phone numbers that are unique and check with the list with every iterative step.
If the data set is well behaved, you can even do this in MATLAB, which contains the function union(), that accomplishes what you are trying to do in 1 line of code.
|
Uh, a quick approach off the top of my head:
1) sort A->Z based on names (keep # column in the same selection so it's sorted along... you know this part right) 2) run a simple macro to check name cells and remove if same as previous (gets rid of duplicate names) 3) run a macro to check number cells and if the pair name cell is empty (duplicate), then append current cell with the previous cell. you may need to do some extra work to make sure excel treats the number cell as "general" or "text" not "number", or you may run into errors. 4) run a macro to remove rows if name cell is empty -> sorted list with no duplicate names and no loss of numbers
|
On December 16 2011 04:16 Chill wrote: Oh okay I just saw what you're looking for (imgur is blocked at work, saw on my phone).
Basically you want to have a unique list of names and then add the other fields together? PM me the file I'll write you a script.
The problem you're going to have is formatting consistency. If someone has John Smith and another one is john smith and lastly smith, john they're all going to be different.
If your data is formatted uniformly then I can do it. Otherwise, manually is the best.
If it is just capitalization that can be taken care of easily. There's a VBA function that converts all char to lower or upper case.
|
On December 16 2011 04:16 Chill wrote:
Basically you want to have a unique list of names and then add the other fields together? PM me the file I'll write you a script. .
Yes, exactly what I was looking for!!! But I have no idea how to do it.
Wow, that is very kind of you, but I doubt my prof would let me since this a list of confidential contact info! Hehe people would be pissed at me. Is there a resource or tutorial online that I could study to tackle this problem? Thanks again.
|
Calgary25969 Posts
On December 16 2011 04:20 OpticalShot wrote: Uh, a quick approach off the top of my head:
1) sort A->Z based on names (keep # column in the same selection so it's sorted along... you know this part right) 2) run a simple macro to check name cells and remove if same as previous (gets rid of duplicate names) 3) run a macro to check number cells and if the pair name cell is empty (duplicate), then append current cell with the previous cell. you may need to do some extra work to make sure excel treats the number cell as "general" or "text" not "number", or you may run into errors. 4) run a macro to remove rows if name cell is empty -> sorted list with no duplicate names and no loss of numbers Yea I have a similar solution.
Run macro that: 1) Sorts A->Z based on names 2) Loops through and checks if n name is the same as n+1 name. 2a) If it is, checks if all n+1 data is unique. If it is, appends it to the n data. If it isn't, skip. 2b) Deletes row n+1 2c) Rechecks row n vs new row n+1 2d) n++ until blank
|
On December 16 2011 04:20 OpticalShot wrote: Uh, a quick approach off the top of my head:
1) sort A->Z based on names (keep # column in the same selection so it's sorted along... you know this part right) 2) run a simple macro to check name cells and remove if same as previous (gets rid of duplicate names) 3) run a macro to check number cells and if the pair name cell is empty (duplicate), then append current cell with the previous cell. you may need to do some extra work to make sure excel treats the number cell as "general" or "text" not "number", or you may run into errors. 4) run a macro to remove rows if name cell is empty -> sorted list with no duplicate names and no loss of numbers
Yes, I know how to sort but I am lost when it comes to running macros. Are macros similar to the functions within excel?
edit: Learning about how to make a macro right now. I'll do my best.
|
for(int i = 0;i<cells.length();i++){ for(int j=1 + i;j<cells.length();j++){ if(name[i].trim() == name[j].trim() && phone[i].trim() == phone[j].trim()){ //delete the row corresponding to name[j] } } }
I don't know how to make macros in excel but I hope that helps 
EDIT:Thanks Glacerz <3
|
Macros are basically VBA code that are recorded through user interactions with Excel. If you find yourself repeating the same action multiple times, you can write a macro and loop through it to simplify the work.
|
On December 16 2011 04:36 mrafaeldie12 wrote:for(int i = 0;i<cells.length();i++){ for(int j=0;j<cells.length();j++){ if(name[i].trim() == name[j].trim() && phone[i].trim() == phone[j].trim()){ //delete the row corresponding to name[j] } } } I don't know how to make macros in excel but I hope that helps 
Your inner loop should be for(int j = i+1;...) not j=0, using it as it is is gonna remove every entry.
|
You can learn macros from just simply: 1) start recording 2) do the action you want to do 3) stop recording 4) check macro
Alternatively, you can google for similar macros and you'll find a few good ones.
|
set up a mysql database with phpmyadmin, export table as csv, import table in phpmyadmin, do your modifications through SQL queries, export as csv, import to excel sheet. If you are familiar with mysql/phpmyadmin, that's probably the easiest way if it's just a one time deal. If either of these conditions is false though, ignore this post
|
infinity21
Canada6683 Posts
1) Sort 2) Remove duplicates 3) Add a column with the function =COUNTIF(A:A,A2) where 2 is the row number 4) Filter for the new column for values >=2 If there's entries for 3+ phone numbers, you're going to have to append the numbers to a different column If you're lucky and there's only a few rows with >=2 phone #s, then you can do it manually without having to code in VBA.
On December 16 2011 04:45 MisterD wrote:set up a mysql database with phpmyadmin, export table as csv, import table in phpmyadmin, do your modifications through SQL queries, export as csv, import to excel sheet. If you are familiar with mysql/phpmyadmin, that's probably the easiest way if it's just a one time deal. If either of these conditions is false though, ignore this post  I don't think someone who doesn't know VBA would know SQL lol
|
Does it have to be in excel? I often combine excel with some really light python scripting and it's ridiculously to do things like this. Put the data in a text file:
file = open(filename) people = {}
for line in file: person, number = line.split() if person not in people.keys(): people[person] = {} people[person][number] = 0 # I'm just being lazy with sets
for person in people.keys(): res = "person" for number in person[people].keys(): res += "\t%s" % number print res
Just pipe that out into a csv and you can copy right back into excel, ezpz.
edit: damnit, it's not preserving my spacing, but if you quote my post it shows the spacing.
|
On December 16 2011 04:49 infinity21 wrote:1) Sort 2) Remove duplicates 3) Add a column with the function =COUNTIF(A:A,A2) where 2 is the row number 4) Filter for the new column for values >=2 If there's entries for 3+ phone numbers, you're going to have to append the numbers to a different column If you're lucky and there's only a few rows with >=2 phone #s, then you can do it manually without having to code in VBA. Show nested quote +On December 16 2011 04:45 MisterD wrote:set up a mysql database with phpmyadmin, export table as csv, import table in phpmyadmin, do your modifications through SQL queries, export as csv, import to excel sheet. If you are familiar with mysql/phpmyadmin, that's probably the easiest way if it's just a one time deal. If either of these conditions is false though, ignore this post  I don't think someone who doesn't know VBA would know SQL lol
Me until this work term bahaha - Richie
|
|
Thanks guys, I feel like I have a direction now. If I uploaded the same file but with fake information (I cannot upload actual contact information), could you guys write a macro for it so I can see what you guys did for this specific problem? I can write macros for other simple functions but the ones listed here are over my head.
edit: it will just be a small part of the info, to keep things simple.
|
Calgary25969 Posts
Your best bet is to use "Record Macro". Do something basic that you need to know the code for. Stop the macro. Press alt 11. Open the macro you just recorded. Steal the code. Put it into your code.
Repeat.
|
infinity21
Canada6683 Posts
On December 16 2011 04:59 zeks wrote:Show nested quote +On December 16 2011 04:49 infinity21 wrote:1) Sort 2) Remove duplicates 3) Add a column with the function =COUNTIF(A:A,A2) where 2 is the row number 4) Filter for the new column for values >=2 If there's entries for 3+ phone numbers, you're going to have to append the numbers to a different column If you're lucky and there's only a few rows with >=2 phone #s, then you can do it manually without having to code in VBA. On December 16 2011 04:45 MisterD wrote:set up a mysql database with phpmyadmin, export table as csv, import table in phpmyadmin, do your modifications through SQL queries, export as csv, import to excel sheet. If you are familiar with mysql/phpmyadmin, that's probably the easiest way if it's just a one time deal. If either of these conditions is false though, ignore this post  I don't think someone who doesn't know VBA would know SQL lol Me until this work term bahaha - Richie haha <3
On December 16 2011 05:03 Golgotha wrote: Thanks guys, I feel like I have a direction now. If I uploaded the same file but with fake information (I cannot upload actual contact information), could you guys write a macro for it so I can see what you guys did for this specific problem? I can write macros for other simple functions but the ones listed here are over my head.
edit: it will just be a small part of the info, to keep things simple. Try what I wrote and if you get a reasonable number of rows (~100?) with 2+ phone numbers, I suggest just doing it manually. If it turns out that there's 500+, post a fake file here and I'll help you out after I'm done with my exam tonight. I'm strangely helpful during exam times...
|
edit: nevermind, you need it done a different way
|
Excel File: http://www.mediafire.com/?jzzta80m4t973fg
I took the original file and cut it down a bunch and changed all the names. But the template we used to map the fields are all the same!
Some of the fields have nothing in them but that is how it really is; the whole project was a bit messy.
In this example there is only one person who is a duplicate. Before I delete the duplicate, I must combine all the other information listed under his name from both rows into one. For example: We can see that he works for two different companies, University of Polo and The Jukebox. I want to combine them both into one field before I delete the duplicate. So in the end the finished part should look something like this: http://i.imgur.com/uT1Qr.png
I hope this more clearly presents what I am trying to do. Thank you! I am hoping I can learn how to automate this process, otherwise I will just do everything manually. But please I don't fancy going over a 6,000 person list
|
On December 16 2011 04:59 MoreFaSho wrote:file = open(filename) people = {}
for line in file: person, number = line.split() if person not in people.keys(): people[person] = {} people[person][number] = 0 # I'm just being lazy with sets
for person in people.keys(): res = "person" for number in person[people].keys(): res += "\t%s" % number print res
I was going to recommend python too, but you were faster. Also, [ code] tag is available on TL
|
It's SO ironic that I was going to create a thread about Excel yesterday but thought it would get closed so I hope the OP doesn't mind me asking this question here but I just finished a course in VBA programming this semester and was wondering if I could put on my resume that I am proficient in excel or is the "programming" side of excel different from what they need in an office. I see so many times in job descriptions of clerical jobs that they want someone who is proficient in word,excel blah blah but for excel I'm guessing they are stricly talking about working with spreadsheets and not the coding aspect of it?
|
On December 16 2011 06:01 MaRiNe23 wrote: It's SO ironic that I was going to create a thread about Excel yesterday but thought it would get closed so I hope the OP doesn't mind me asking this question here but I just finished a course in VBA programming this semester and was wondering if I could put on my resume that I am proficient in excel or is the "programming" side of excel different from what they need in an office. I see so many times in job descriptions of clerical jobs that they want someone who is proficient in word,excel blah blah but for excel I'm guessing they are stricly talking about working with spreadsheets and not the coding aspect of it?
Proficiency in excel does not mean VBA. If you know how to work a pivot table you should be able to put both Excel and VBA on your resume with no problem.
|
I always wondered what it means to be proficient in Word though...
|
On December 16 2011 06:10 Glacierz wrote:Show nested quote +On December 16 2011 06:01 MaRiNe23 wrote: It's SO ironic that I was going to create a thread about Excel yesterday but thought it would get closed so I hope the OP doesn't mind me asking this question here but I just finished a course in VBA programming this semester and was wondering if I could put on my resume that I am proficient in excel or is the "programming" side of excel different from what they need in an office. I see so many times in job descriptions of clerical jobs that they want someone who is proficient in word,excel blah blah but for excel I'm guessing they are stricly talking about working with spreadsheets and not the coding aspect of it? Proficiency in excel does not mean VBA. If you know how to work a pivot table you should be able to put both Excel and VBA on your resume with no problem. If I learn that through google or youtube do you think I'll be okay to put it on my resume then? Because I know this guy on youtube "excelisfun" who has like 10 billion youtube videos about excel. I think I can find out how to work a pivot table from one of those vids and worth it to learn if it'll allow me to put those two on my resume..
|
On December 16 2011 06:10 Glacierz wrote:Show nested quote +On December 16 2011 06:01 MaRiNe23 wrote: It's SO ironic that I was going to create a thread about Excel yesterday but thought it would get closed so I hope the OP doesn't mind me asking this question here but I just finished a course in VBA programming this semester and was wondering if I could put on my resume that I am proficient in excel or is the "programming" side of excel different from what they need in an office. I see so many times in job descriptions of clerical jobs that they want someone who is proficient in word,excel blah blah but for excel I'm guessing they are stricly talking about working with spreadsheets and not the coding aspect of it? Proficiency in excel does not mean VBA. If you know how to work a pivot table you should be able to put both Excel and VBA on your resume with no problem.
I agree with Glacierz. - put both and put more than proficient. That is such a resume buzz word that someone may just assume you know how to use excel, but not know VBA or more. If you get an interview definitely describe how progaming (gramming) can help their department. I work in accounting and in the first year here took a lot of the tasks they gave me and halved the time previously spent on it. Office environments need excel programmers, they just don't know they do (maybe not full time, but they should be talking to consultants). Good luck
|
On December 16 2011 06:20 MaRiNe23 wrote:Show nested quote +On December 16 2011 06:10 Glacierz wrote:On December 16 2011 06:01 MaRiNe23 wrote: It's SO ironic that I was going to create a thread about Excel yesterday but thought it would get closed so I hope the OP doesn't mind me asking this question here but I just finished a course in VBA programming this semester and was wondering if I could put on my resume that I am proficient in excel or is the "programming" side of excel different from what they need in an office. I see so many times in job descriptions of clerical jobs that they want someone who is proficient in word,excel blah blah but for excel I'm guessing they are stricly talking about working with spreadsheets and not the coding aspect of it? Proficiency in excel does not mean VBA. If you know how to work a pivot table you should be able to put both Excel and VBA on your resume with no problem. If I learn that through google or youtube do you think I'll be okay to put it on my resume then? Because I know this guy on youtube "excelisfun" who has like 10 billion youtube videos about excel. I think I can find out how to work a pivot table from one of those vids and worth it to learn if it'll allow me to put those two on my resume..
That's fine. MS office products do not require actual courses to master (maybe VBA would be the only exception). Just google whatever you need to learn. I put pivot table out there because that's as far as any job would require when Excel is listed as a pre-req. I'd look through a curriculum, and stop after pivot tables.
|
Concatenate the values with '&' and then check if there are any duplicates
|
|
Not testing this code, but you said you have some experience with VBA so I'm assuming you can edit appropriately. Sort first, then run.
Sub combineDuplicates()
For i = 2 to ActiveSheet.UsedRange.SpecialCells(xlCellTypeLastCell).Row If Cells(i,2).Value = Cells(i-1,2).Value And Cells(i,3).Value = Cells(i-1,3).Value And Cells(i,4).Value=Cells(i-1,4).Value Then For j = 4 to ActiveSheet.UsedRange.SpecialCells(xlCellTypeLastCell).Column Cells(i-1,j) = Cells(i-1,j).Value & ";" & Cells(i,j).Value Next j Cells(i,j).EntireRow.Delete i = i-1 Next i
End Sub
This is assuming column 2,3,4 are your name identifiers and anything after are things that should be concatenated.
edit: and yes I know this code is weak super lame for you VBA'ers out there, but deal with it
|
You don't even need VBA. There is a built-in function called "remove duplicates" that does exactly what you need.
Select your 6000 as well as the columns name, cell, address. Click on the Excel tab "Data". Then click on the icon "Remove Duplicates". And voilà.
|
infinity21
Canada6683 Posts
If you have that many columns, I don't think you can take any shortcuts... Only way I can think of right now is 1) sort 2) concatenate the first and last name into a column at the end, say Z 3) for row i, declare a dynamic array of size (countif(Z: Z ,Zi), original # of columns) 4) have a function that looks at each entry starting from row i to row i + countif(Z: Z, Zi) -1 and appends to the dynamic array if it isn't in the array already 5) paste the dynamic array with a function that automatically inserts a separator between each entry and doesn't put in blanks 6) i=i+countif(Z: Z, Zi) and repeat
|
On December 16 2011 04:49 infinity21 wrote:1) Sort 2) Remove duplicates 3) Add a column with the function =COUNTIF(A:A,A2) where 2 is the row number 4) Filter for the new column for values >=2 If there's entries for 3+ phone numbers, you're going to have to append the numbers to a different column If you're lucky and there's only a few rows with >=2 phone #s, then you can do it manually without having to code in VBA. Show nested quote +On December 16 2011 04:45 MisterD wrote:set up a mysql database with phpmyadmin, export table as csv, import table in phpmyadmin, do your modifications through SQL queries, export as csv, import to excel sheet. If you are familiar with mysql/phpmyadmin, that's probably the easiest way if it's just a one time deal. If either of these conditions is false though, ignore this post  I don't think someone who doesn't know VBA would know SQL lol
well -- i do? the "new gen" university software developers around here are all java grown. with the advancement of mac and linux on personal use computers, there's no point in even trying to use a microsoft language during education. SQL however is pretty much the first thing in any database lecture. so there's a whole bunch of people who are this way.
though you are probably correct with your assumption "out in the market", where microsoft languages are a lot more popular due to their legacy.
|
infinity21
Canada6683 Posts
On December 17 2011 02:32 MisterD wrote:Show nested quote +On December 16 2011 04:49 infinity21 wrote:1) Sort 2) Remove duplicates 3) Add a column with the function =COUNTIF(A:A,A2) where 2 is the row number 4) Filter for the new column for values >=2 If there's entries for 3+ phone numbers, you're going to have to append the numbers to a different column If you're lucky and there's only a few rows with >=2 phone #s, then you can do it manually without having to code in VBA. On December 16 2011 04:45 MisterD wrote:set up a mysql database with phpmyadmin, export table as csv, import table in phpmyadmin, do your modifications through SQL queries, export as csv, import to excel sheet. If you are familiar with mysql/phpmyadmin, that's probably the easiest way if it's just a one time deal. If either of these conditions is false though, ignore this post  I don't think someone who doesn't know VBA would know SQL lol well -- i do? the "new gen" university software developers around here are all java grown. with the advancement of mac and linux on personal use computers, there's no point in even trying to use a microsoft language during education. SQL however is pretty much the first thing in any database lecture. so there's a whole bunch of people who are this way. though you are probably correct with your assumption "out in the market", where microsoft languages are a lot more popular due to their legacy. Yeah I was referring more to the workplace. People who knows SQL are business analysts or database administrators (generalizing here obv) so they usually have some programming skills like Java, C++, etc. And imo if you know how to program in either one of those two, you can pick up VBA in a couple of hours which has been the case for me at least. Not to say that there are zero jobs that require SQL and no VBA, but those are more of the exceptions.
|
|
|
|