Reason
The culprit is the emoji in the tweets. Some of the emojis have four bytes UTF-8 character.
The MYSQL's character set named UTF8 uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the UTF8MB4 character set uses a maximum of four bytes per character supports. http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
Solution proposed
Solution 1
Strip down all string that need more than 3 bytes character
Solution 2
Use MySQL 5.5 or later and change the column encoding from utf8 to utf8mb4. This encoding allows storage of characters that occupy 4 bytes in UTF-8.
Solution 1(http://stackoverflow.com/questions/8491431/remove-4-byte-characters-from-a-utf-8-string):
Since 4-byte UTF-8 sequences always start with the bytes 0xF0-0xF7
, the following should work:
$str = preg_replace('/[\xF0-\xF7][\x00-\xFF]{3}/s', '', $str);
Alternatively, you could use preg_replace
in UTF-8 mode but this will probably be slower:
$str = preg_replace('/[\x{10000}-\x{10FFFF}]/u', '', $str);
This works because 4-byte UTF-8 sequences are used for code points in the supplementary Unicode planes starting from 0x10000
.