Data cleaning is very important in modern world and especially Python. One of the common problems is related to extra spaces which can lead to visual similarity but often cause mistakes. For example:

   The   fox jumped   over    the log.   

and

The fox jumped over the log.

may seems to be similar but from technical point of view they are not. In this article we are going to clean the string by removing multiple spaces in a string by two ways:

  • Pythonic code with split and join
    • out = " ".join(my_str.split())
  • using a regular expression
    • out1 = re.sub("\s\s+" , " ", my_str)
    • out = re.sub(' +', ' ', my_str)

If you want to check how to compare string case insensitive or how to make unique list(comparing case insensitive) than you can check:

Python unique list case insensitive

Remove consecutive spaces with split and join

In this case by using split and join we are going to remove also the starting and ending spaces. Which is the main difference to the next approach. The explanation of this technique is:

  • first we split the string on spaces
    • no parameters for split is by default space
    • consecutive separators are considered as one
  • then all output strings are collected in list
  • Finally we join the list with a single space
  • Note that not only white spaces are removed but also:
    • all whitespace characters - tab, newline, return, formfeed
my_str = "   The   fox jumped       \n over    the log.   "
out = " ".join(my_str.split())
print(out)

result:

The fox jumped over the log.

Regular expression to remove extra spaces from string

Next way of solving the same problem is a bit different. Instead of extracting the words and the content of a string. We are going to search for extra spaces and replace them with a single one. This approach has some advantages and disadvantages:

  • Note that only white spaces are removed:
    • tab, newline, return, formfeed are not replaced
  • Leading and ending spaces can be kept
  • You can control how many characters to be removed
  • Better control and tune for special cases
my_str = "   The   fox jumped       \n over    the log.   "
import re
out = re.sub(' +', ' ', my_str)
print(out)

result:

 The fox jumped 
 over the log. 

As you can see the newline is preserved. If you want you can remove the newline with regular expression by:

my_str = "   The   fox jumped       \n over    the log.   "

import re
out1 = re.sub("\s\s+" , " ", my_str)
out2 = re.sub("\s\s+", " ", my_str)
print(out1)
print(out2)

result:

 The fox jumped over the log. 
 The fox jumped over the log.