• 0

[SQL] Compare Addresses


Question

Hello gang,

We are having an issue where we need to compare some address data to insure that we are not entering a duplicate record. (The data is coming from multiple outside sources, so an identity/GUID won't work... but thanks for thinking of that)

Consider the following

Company Name

Address

City

State

ZIP Code

If the first person was to send us:

"Bob's Restaurant"

"123 Main Street"

"New York"

"NY"

"10018"

and the second person sends us:

"Bobs Restaurant"

"123 Main St."

"New York"

"NY"

"10018"

Notice that the second restaurant does not have the appostrophy and the street name is abreviated. So an exact match is SQL would not work.

We're looking at Latitude and Longitude, and that should get us there, but in case anyone has any other thoughts..... thanks.

Link to comment
https://www.neowin.net/forum/topic/719338-sql-compare-addresses/
Share on other sites

8 answers to this question

Recommended Posts

  • 0

I'm not sure what database server you are using, but how about something like soundex()? If you can compile a list of common ways to abbreviate parts of an address, and could be reasonably sure that all occurances of St. really equals Street (I'm not sure how your data is stored, if you have city in one column, state in one column, street address in one column, name of location in one column, etc. then this would probably doable), then this would work for you:

DROP TEMPORARY TABLE IF EXISTS t1;
CREATE TEMPORARY TABLE t1 
(
  address text
);

INSERT INTO t1 (address) 
VALUES 
('Bob\'s Restaurant 123 Main Street New York NY 10018'),
('Bobs Restaurant 123 Main St. New York NY 10018');

UPDATE t1 SET address = REPLACE(address, 'St.', 'Street');

SELECT address, 
		   soundex(address) 
FROM t1;

This returns this data set, so, you can see that it is a duplicate address:

+----------------------------------------------------+-------------------+
| address											| soundex(address)  |
+----------------------------------------------------+-------------------+
| Bob's Restaurant 123 Main Street New York NY 10018 | B2623653523635625 |
| Bobs Restaurant 123 Main Street New York NY 10018  | B2623653523635625 |
+----------------------------------------------------+-------------------+

In practice, the way I would do this is to create a stored procedure that takes an address as a parameter or set of parameters, one param for each part of the address. Then you could run a set of replaces against the parameter to normalize it to the rest of the data already in the database (replace St. with Street, or, Street with St., whichever you want, Dr. with Drive, Ln. with Lane, etc.). So, assuming your sp took a param named p_address, and you have already ran your replaces to normalize the string, you could run a query like this:

DECLARE p_is_duplicate tinyint(1) NOT NULL DEFAULT 0;

SET p_is_duplicate = EXISTS(SELECT * FROM addresses WHERE soundex(address) = soundex(p_address);

IF NOT p_is_duplicate THEN 
  INSERT INTO address SET address = p_address;
END IF;

Hope this helps, latitude and longitude will probably be more accurate in the long run, but something like this may work for you depending on what dbms you use and how your data is stored. :)

  • 0

UPDATE:

There is an issue with SoundEx where is does not acurately compare. The two examples below both return values that state they are the same even though they are not (different city, state and zip)

I understand that I can break out the street while doing my compare, and I may do that, I simply wanted to alert other developers of this issue.

SELECT SOUNDEX ('Bob 500 1st St. Seattle, WA 98109'), SOUNDEX ('Bobby 500 First St New York, NY')

SELECT DIFFERENCE ('Bob 500 1st St, Seattle 98109', 'Bobby 500 First St New York, NY 10036')

I may actually may use a DLL that allows me to create lat/long and compare these values....

Thanks again for the feedback and idea.

  • 0

I knew there was another SoundEx example that I wanted to mention:

SELECT SOUNDEX ('500 First Street'), SOUNDEX ('400 First Street')

Both of these come back as a match. So be aware that soundex is prob not the best solution for address compare. I've moved on to (attempting) to work with lat/long. (It's the getting of these values that is killing me)

Have a great weekend.

  • 0
I'm not sure what database server you are using, but how about something like soundex()? If you can compile a list of common ways to abbreviate parts of an address, and could be reasonably sure that all occurances of St. really equals Street (I'm not sure how your data is stored, if you have city in one column, state in one column, street address in one column, name of location in one column, etc. then this would probably doable), then this would work for you:

DROP TEMPORARY TABLE IF EXISTS t1;
CREATE TEMPORARY TABLE t1 
(
  address text
);

INSERT INTO t1 (address) 
VALUES 
('Bob\'s Restaurant 123 Main Street New York NY 10018'),
('Bobs Restaurant 123 Main St. New York NY 10018');

UPDATE t1 SET address = REPLACE(address, 'St.', 'Street');

SELECT address, 
		   soundex(address) 
FROM t1;

This returns this data set, so, you can see that it is a duplicate address:

+----------------------------------------------------+-------------------+
| address											| soundex(address)  |
+----------------------------------------------------+-------------------+
| Bob's Restaurant 123 Main Street New York NY 10018 | B2623653523635625 |
| Bobs Restaurant 123 Main Street New York NY 10018  | B2623653523635625 |
+----------------------------------------------------+-------------------+

In practice, the way I would do this is to create a stored procedure that takes an address as a parameter or set of parameters, one param for each part of the address. Then you could run a set of replaces against the parameter to normalize it to the rest of the data already in the database (replace St. with Street, or, Street with St., whichever you want, Dr. with Drive, Ln. with Lane, etc.). So, assuming your sp took a param named p_address, and you have already ran your replaces to normalize the string, you could run a query like this:

DECLARE p_is_duplicate tinyint(1) NOT NULL DEFAULT 0;

SET p_is_duplicate = EXISTS(SELECT * FROM addresses WHERE soundex(address) = soundex(p_address);

IF NOT p_is_duplicate THEN 
  INSERT INTO address SET address = p_address;
END IF;

Hope this helps, latitude and longitude will probably be more accurate in the long run, but something like this may work for you depending on what dbms you use and how your data is stored. :)

Soundex performs a match based on the pronumciation of the first four constanants it encounters (I had to look this up before because I ran into a similar situation). So "Bob's R" and "Bobs R" would be used for comparison and thus match.

It's also a good habit to ALWAYS replace input quotation marks and apostrophes with "`" or "``". You're open to SQL injection attacks if you don't.

  • 0

Try using a CHECKSUM instead:

SELECT * FROM ADDRESS WHERE CHECKSUM(company + address + city + state + zip) = CHECKSUM(@company + @address + @city + @state + @zip)

Be sure to encase any fields with an isnull() to default the value out to '' if any of those fields can be null. EG

SELECT * FROM ADDRESS WHERE CHECKSUM(isnull(company, '') + isnull(address, '') + isnull(city, '') + isnull(state, '') + isnull(zip, '')) = CHECKSUM(@company + @address + @city + @state + @zip)

If the query returns any rows, you have a duplicate.

This topic is now closed to further replies.
  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Posts

    • Google is opening the world's first AI museum in Los Angeles by Ivan Jenic Image via: Google Ever since AI image generators went mainstream, the debate over whether AI-generated art is real art hasn't let up. Those who don’t consider AI to be art say that if a machine does the creating and anyone can prompt it, there’s no skill involved, and therefore no art is produced. The counter-argument is equally persistent, as defenders of AI-generated artworks often claim that AI is just a tool, and that every major technological breakthrough, like the camera or the computer, was met with the same skepticism before eventually being accepted as a legitimate creative medium. Google’s position in this debate is clear. Which is no surprise, as the company is investing billions in AI infrastructure. And now, in efforts to encourage people to use its AI even more, Google is opening Dataland on June 20, which it's calling the world's first AI arts museum. Located inside The Grand LA, a Frank Gehry-designed building in Los Angeles, the museum spans 25,000 square feet. The museum is built around a collaboration with media artist Refik Anadol, who has worked with Google since 2016. The inaugural exhibition is called Machine Dreams: Rainforest, and is powered by an AI model trained on “an extensive dataset of the natural world.” It generates 1.2 billion pixels of visuals in real time and reacts to visitors dynamically. The space also generates soundscapes, real-time emotion sensing, and algorithmically produced scents. Image via: Refik Anadol Studio / Google Google says that the museum is powered by its Gemini models, which run on Google Cloud. So, everything is generated inside one of Google’s AI data centers and is streamed to the museum. Alongside the museum opening, Google Arts & Culture is funding an AI Artist Residency, giving four artists $25,000 grants each, along with mentorship from Refik Anadol Studio and access to Google's machine learning tools. Their work will be shown at Dataland and on the Google Arts & Culture website later this year. Google’s AI museum will undoubtedly initiate a fired-up debate on social media, and we can’t wait to see the first reactions. Via: Smithsonian Magazine
    • Calling GTA 6 overhyped crap doesn’t make you edgy, it just makes you sound like someone who hasn’t enjoyed anything since the PS2 era.
    • I’m not arguing whether Rockstar likes money. Obviously, they do, they’re a business. I’m saying this isn’t new. They’ve always launched console first. This is just how Rockstar operates.
    • I'm not sure how old the school is, but they've been doing this since GTA 3. Back in those days we'd be lucky for game companies to release on the PC at all. And with the current state of Sony (or Microsoft) their gaming wing won't be getting a penny from me.
    • We now know when and how the Universe may truly end by Sayan Sen Image by Marek Pavlík via Pexels| Not representative A study by physicist Henry Tye of Cornell University suggests that the universe may not expand forever. Instead, it could eventually stop expanding, begin contracting and end in a "Big Crunch" roughly 20 billion years from now. The research, published in the Journal of Cosmology and Astroparticle Physics, was conducted by Tye, Horace White Professor of Physics Emeritus at Cornell University. Using recent observations from major dark-energy surveys, Tye and his collaborators developed a cosmological model that predicts the universe could have a total lifespan of about 33 billion years. Since the universe is currently estimated to be 13.8 billion years old, the model places it near the midpoint of its existence. According to Cornell University's summary of the research, the study centers on the cosmological constant, a term introduced by Albert Einstein in his theory of general relativity. In modern cosmology, the cosmological constant is commonly used to describe the simplest form of dark energy, the unknown phenomenon believed to be driving the accelerating expansion of the universe. "For the last 20 years, people believed that the cosmological constant is positive, and the universe will expand forever," Tye said in a Cornell University news release. "The new data seem to indicate that the cosmological constant is negative, and that the universe will end in a big crunch." The study draws on data from the Dark Energy Survey (DES) and the Dark Energy Spectroscopic Instrument (DESI), two major projects designed to investigate the nature of dark energy. According to Tye, recent observations suggest that dark energy may not behave exactly like a simple cosmological constant. To account for those observations, Tye and his collaborators proposed a model involving an extremely light hypothetical particle that evolves over time. In their calculations, this produces a negative cosmological constant and leads to a future collapse of the universe. The model predicts that cosmic expansion would continue for approximately another 11 billion years before reaching a maximum size, after which the universe would begin contracting and eventually collapse. Scientists have long debated how the universe might end. As explained in an article published in The Conversation by Stephen DiKerby of Michigan State University, several possibilities have been proposed. If dark energy remains constant and positive, the universe could continue expanding indefinitely, gradually becoming colder, darker and more diffuse in a scenario often called the "heat death" of the universe. Other theoretical possibilities include a Big Rip, in which cosmic expansion accelerates so dramatically that galaxies, stars and even atoms are torn apart, or a Big Crunch, in which expansion reverses and the universe collapses back into an extremely dense state. DiKerby notes that the Big Crunch idea itself is not new. What distinguishes Tye's work is that it attempts to use current observational data to estimate when such a collapse might occur and how it could unfold. Much of the universe's long-term evolution remains uncertain. According to current astrophysical understanding, stars will continue to form and die for billions of years. The Sun, for example, is about halfway through its expected lifespan. Galaxies are also expected to continue merging; the Milky Way and Andromeda galaxies are projected to collide several billion years from now. At the same time, the nature of dark energy remains one of the biggest unanswered questions in cosmology. While observations indicate that the universe's expansion is accelerating, scientists still do not know what is causing that acceleration. Future observations may therefore alter current predictions about the cosmos's ultimate fate. Tye emphasized that additional evidence will be needed before firm conclusions can be drawn. DESI continues to collect data, while upcoming observations from missions and observatories including Euclid, SPHEREx and the Vera C. Rubin Observatory are expected to provide more precise measurements of dark energy. "People have said before that if the cosmological constant is negative, then the universe will collapse eventually. That's not new," Tye said. "However, here the model tells you when the universe collapses and how it collapses." For now, the study presents one possible future for the cosmos rather than a settled prediction. Whether the universe ultimately ends in a Big Crunch, expands forever, or follows another path entirely remains an open question that future observations will help answer. Source: Cornell University, The Conversation This article was generated with some help from AI and reviewed by an editor. Under Section 107 of the Copyright Act 1976, this material is used for the purpose of news reporting. Fair use is a use permitted by copyright statute that might otherwise be infringing.
  • Recent Achievements

    • First Post
      AndreaB earned a badge
      First Post
    • Week One Done
      Huge Trailer earned a badge
      Week One Done
    • Week One Done
      Classifyskilleducation earned a badge
      Week One Done
    • One Month Later
      eurospharma62 earned a badge
      One Month Later
    • Week One Done
      With What earned a badge
      Week One Done
  • Popular Contributors

    1. 1
      +primortal
      570
    2. 2
      +Edouard
      175
    3. 3
      PsYcHoKiLLa
      73
    4. 4
      Michael Scrip
      68
    5. 5
      neufuse
      64
  • Tell a friend

    Love Neowin? Tell a friend!