Entries Tagged as ''

How to optimize PostgreSQL database size

PostgreSQL is a powerful, open source relational database system. It has more than 15 years of active development and a proven architecture that has earned it a strong reputation for reliability, data integrity, and correctness.

One of the PostgreSQL’s most sophisticated features is so called Multi-Version Concurrency Control (MVCC), a standard technique for avoiding conflicts between reads and writes of the same object in database. MVCC guarantees that each transaction sees a consistent view of the database by reading non-current data for objects modified by concurrent transactions. Thanks to MVCC, PostgreSQL has great scalability, a robust hot backup tool and many other nice features comparable to the most advanced commercial databases.

Unfortunately, there is one downside to MVCC, the databases tend to grow over time and sometimes it can be a problem. In recent versions of PostgreSQL there is a separate server process called the autovacuum daemon (pg_autovacuum), whose purpose is to keep the database size reasonable. It does that by trying to recover reusable chunks of the database files. Still, there are many scenarios that will force the database to grow, even if the amount of the useful data in it doesn’t really change. That happens typically if you have lots of UPDATE and/or DELETE statements in the applications that are using the database.

Mechanisms for recovering used space are well documented, and I decided to do a little test to see how much space can actually be recovered. Unfortunately, all of the procedures you’ll find in this article demand that database is taken offline during their run. But, that’s the price you’ll have to pay to get your disk space back.

Test setup

For this interesting test I have used a real database that was taken from a great commercial antispam/antivirus product. Original size of the database was 5876 MB. The database was running for few weeks before I decided to take its database for “academic” purposes. The tests were run on a quite old (dual PIII), but still capable machine (mostly thanks to a 6 SCSI 10krpm disk RAID6 array). The PostgreSQL database under test was a slightly older one, version 7.4.8, because that is what the product is shipping currently. The pg_autovacuum daemon was running all the time while the database was in production (keeping metadata information about lots of quarantined spam and/or virus email messages, you probably wanted to know :) ).

Vacuuming

Standard procedure for getting back the space is the VACUUM FULL command. While plain VACUUM (without FULL) simply reclaims space and makes it available for re-use, VACUUM FULL does more extensive processing, including compacting the database files to the minimum number of disk blocks. The documentation clearly says that this form is much slower and requires an exclusive lock on each table while it is being processed.

So, running it in the form of “vacuumdb -afvz” command, which covered the whole database and also generated some useful statistics, revealed that it indeed is a slow process. It took full 3 hours for the command to finish. The resulting database was occupying 4278 MB database (a 27% reduction in database size). During the run, vacuumdb was heavy on both CPU and disk.

Reindexing

Next recommended procedure is reindexing the whole database, because indexes can be even tougher when it comes to reusing the old blocks. I was scared that reindexing would took more time than vacuuming, but it finished in only 9 minutes. What was even more surprising is that reindexing the database (using reindexdb command) saved another 39% of the initial database size!

It looks like reindexdb is much faster and more efficient about getting the disk space back, at least on this test database. The final database size was 1990 MB, only 34% of what we started with. I hope I don’t need to remind that the database still holds the same data, it’s just physical disk usage that has dropped three times after slightly more than 3 hours of vacuuming and reindexing. Reindexing was mostly CPU bound, so it would be even faster on a modern machine with faster processors.

Take 2: dump and restore

Then I decided to do a completely different test. I reverted the test database to the initial state, full 5876 MB, and then proceeded to dump it, with the idea of clean restore after that, of course. That is a standard procedure when you’re upgrading a PostgreSQL database from one major version to another (for example: 7.4 to 8.0). Dumping the database (with a standard pg_dump command) took only 3 and half minutes, which came as a pleasant surprise. After that I dropped the database (with dropdb) and recreated it (createdb) with the same encoding and owner and used the standard psql utility to import the data back which took another 16 minutes.

If you compare this times to the former ones, you’ll see that this procedure took only around 20 minutes which is almost 10 times faster! But, best part is yet to come. The final database size was only 1685 MB which is less than what we achieved with the vacuumdb/reindexdb procedure. Once again, the useful data in the database is identical to the data we initially had.

Conclusion

If you have read all the details above closely, you now know that standard PostgreSQL recipe of running VACUUM FULL in regular intervals (typically overnight or on weekends) might not be the best choice. If reloading database (dump/restore procedure) does the job 10 faster and with a better final result then maybe that is what you should do, no matter how ugly it looks. But, of course, before you make final decision, you should run tests on your own database to see what’s best for you.

Also it can be concluded that the PostgreSQL database could use some improvements in this specific area of disk space management. I know that disk space is quite cheap nowadays, but I’m also pretty sure that oversized database is not only taking more space, but is also somewhat slower because it has to keep track about more data on disk (think disk caches, RAM…).

In the end, it would be interesting to hear what have other people observed on various database and hardware setups out there.

linuxinsight.com

IBM/University Collaboration To Develop Open-Source Accessibility Tools

University researchers in the United States and the U.K. have started collaborating with IBM to develop open-source software solutions aimed at bringing accessibility to older workers to “help them adapt to and remain productive in the changing workplace of the 21st century,” according to information released late last week by IBM. The effort is part of IBM’s Open Collaborative Research initiative, which creates partnerships with universities for the purpose of developing and releasing open-source tools to the public.

The new partnership brings together researchers from the University of Dundee’s School of Computing in the U.K. and the University of Miami’s Miller School of Medicine in a concerted effort to develop assistive technologies and gain insight into new ways to improve workplace performance in maturing workers by improving the workplace itself and by increasing the ability of older workers to use technology.

“Older workers represent an extremely valuable resource. However they need to have tools available to them to be able to compete in today’s technology driven workplace,” said Sara J. Czaja, a professor at the University of Miami Miller School of Medicine, in a statement released Friday. “We are excited about the opportunity to collaborate with IBM as it provides us with a unique opportunity to have the results of our research efforts translated into actual products and tools. It also provides a wonderful opportunity for our students.”

Miami’s research efforts will focus on technology work settings, with an emphasis on developing e-learning tools targeting older workers. The work will build on research from the Center for Research and Education on Aging and Technology Enhancement (CREATE) and the Center on Aging. Dundee will focus on producing visualizations and interaction scenarios.

The results of the project, as with all IBM OCR projects, will be “made available as open source software code, and all additional intellectual property developed based on those results will be openly published or made available royalty-free,” according to IBM.

By David Nagel

Read More:

FBI director targets the Internet’s top dangers

The Internet’s dark side and the growing threat of nefarious attack via the ‘Net were the central themes of a pointed cyber security speech by the Federal Bureau of Investigation’s Director Robert Mueller yesterday at The Pennsylvania State University.

“If we lose the Internet, we do not simply lose the ability to e-mail or to surf the Web. We lose access to our data. We lose our connectivity. We lose our intellectual property. We lose our security. What happens when the so-called ‘Invisible Man’ locks us out of our own homes, our offices, and our information?” Mueller said. “The threat is not limited to hackers on the outside. Insiders present a significant problem. Contractors may take the appropriate security measures, but what about those with whom they subcontract and their subs? And what of those who take advantage of open access to research and development facilities on campuses such as this?”

The FBI’s chief made a number of dark points:

Terror and the Web: “Take the case of Younis Tsouli, the self-styled “Terrorist 007″ who not only served as an al Qaeda webmaster but also hacked into servers to get additional bandwidth, used phishing schemes to steal credit card accounts and buy $3 million worth of terrorist equipment, and created a website “that he hoped would become the YouTube for terrorists” called “You bomb it.” Could you fall for a scam or run a server that could end up helping terrorists?”

Estonia: “The Internet is not only the means by which attacks may be planned and executed, it is a target in and of itself. Last April, Estonia suffered what has been called a “cyber blockade.” Wave after wave of data requests from computers around the world shut down banks and emergency phone lines, gas stations and grocery stores, newspapers and television stations, even the prime minister’s office. Although the source of this attack has not been confirmed, the effect was real, and left all of us aware of the potential risk we face. How long before others around the world begin to employ similar tactics?”

BotNets:Botnets are networks of computers taken over by hackers-usually without their owners’ knowledge. Once under their thumbs, these networks can wreak all kinds of havoc, from shutting down a power grid to flooding an emergency call center with millions of spam messages.”

The invisible man: “Hackers are using sophisticated techniques to steal sensitive intelligence, scientific research, and communications data. They are difficult to identify and track because they move in and out of international systems at will, and they do not leave broken glass behind. A member of our cyber team describes it as having an invisible man in the room, standing over your shoulder, seeing and hearing everything you do, watching every word you type. And you may never know he is there…who he represents…or how much damage he has done.”

On the brighter side, Mueller said there is a growing army of specialists that can counter the myriad online threats. For example:

Threat defense: “Officers, agents, and IT specialists in our Regional Computer Forensic Labs find and examine digital evidence from e-mail and cell phone data to documents on hard drives. Together, we continue to break new ground in the investigation and prosecution of cyber criminals. But we cannot limit our operations to the United States. Increasingly, cyber threats originate outside of our borders. And as more people around the world gain access to computer technology, new dangers will surface. For this reason, global cooperation is vital. We have 60 Legal Attaché offices around the world. We are working with our partners in Romania, Russia, Poland, Hungary, Italy, and Estonia, among others, to investigate international cyber threats.”

Cyber Fusion Center: “Much of our collaboration begins in Pittsburgh-at the FBI’s Cyber Fusion Center. Think of the fusion center as a hub, with spokes that range from federal agencies, software companies, and ISPs, to merchants and members of the financial sector. Industry experts from companies such as Cisco, Bank of America, and Target sit side-by-side with the FBI, postal inspectors, the Federal Trade Commission, and many others, sharing information and ideas. Together, we have created a neutral space where cyber experts and competitors, who might not otherwise collaborate, can talk about cyber threats and security breaches.”

The FBI’s InfraGard program: “A more localized example of our private sector partnerships. Members from a host of industries, from computer security to the chemical sector, share information about threats to their own companies, in their own communities, through a secure computer server. To date, there are nearly 21,000 members of InfraGard, from Fortune 500 companies to small businesses. That amounts to 21,000 partners in our mission to protect America.”

Mueller wasn’t all dark tales however. “Two weeks ago, in the middle of the World Series, the Colorado Rockies suffered a denial of service attack-just minutes after tickets went on sale for the Rockies’ home games against the Red Sox. Thousands of fans were unable to buy tickets-fans who were ultimately spared the spectacle of witnessing a clean sweep.I reference this case because it highlights our dependence on computer technology and the seriousness of the cyber threat. But it also gives me one more excuse to remind everyone that the Red Sox won the World Series…again.”
networkworld.com

FreeRice

Web game provides rice for hungry

WFP food aid being distributed in Somalia - 26/09/2007

The WFP says hunger is the world’s number one emergency

An internet word game has generated enough rice to feed 50,000 people for one day, the UN’s World Food Programme (WFP) has said. The game, FreeRice, tests the vocabulary of participants. For each click on a correct answer, the website donates money to buy 10 grains of rice.

Companies advertising on the website provide the money to the WFP to buy and distribute the rice.

FreeRice went online in early October and has now raised 1bn grains of rice.

That is enough rice to feed 50,000 people for one day, the WFP said on Friday.

http://www.freerice.com

Hacker to Hacker Conference – H2HC

http://www.h2hc.org.br
Repository
Capture the Flag 2007

Google search tip: site, define, filetype

Search engines have gotten so good that I sometimes forget that they offer fantastically useful syntax for more specific results. I’ll point these out from time to time, but here are my top three most useful Google search modifiers that use a colon:

site:URL and search term. As in, site:www.pcmag.com “wireless router.” Insiders point out that this modifier is even stronger if you drop the www. You can also drop the domain name entirely and search, for example, only .gov sites.

define:word. This brings up definitions, related phrases, and offers to translate the word. Example: define:network.

filetype:file extension and search term. It may be obvious, but this lets you search for files with a certain extension, such as ISO for an image. Example:

Java Mozdev

The Java Mozdev project provides XPIs for easy installation of Java within Mozilla.

More Information

Linux – What Java do I Need

Blackdown JRE 1.4.x