Hex, Bugs and More Physics | Emre S. Tasci

a blog about physics, computation, computational physics and materials…

Creating Booklets from PDF files

May 29, 2008 Posted by Emre S. Tasci

Some printers have the ability to automatically generate booklets from the sequentially ordered pdf files you send to them. But some (95%) don’t and if your printer is indeed one of the majority, you can definitely benefit from this script which converts your pdf files into booklets. The pages are shrunk to A5 and all you need in order to have your booklet is just to fold the output from the middle. I found this script via Maemst Blog who points at the PDF/PS hacks page of Pro-Linux page and also to Michael Roessler‘s script which I’m quoting as follows:

sururi@dutsm0175 tmp $ cat booklet.sh
#!/bin/bash
#
# call with file.pdf
file=$1
filebase=$(basename $file .pdf)
pdftops $file output.ps
psbook output.ps tmp.ps
pstops "4:0L@.7(21cm,0)+1L@.7(21cm,14.85cm),2R@.7(0,29.7cm)+3R@.7(0,14.85cm)" tmp.ps > ${filebase}-booklet.ps
rm -f output.ps tmp.ps
echo "Converting back to pdf …"
ps2pdf ${filebase}-booklet.ps
rm -f ${filebase}-booklet.ps
sururi@dutsm0175 tmp $

There may be some problems when your path includes spaces (even if you escape them) and the first page may be in the wrong side (and inverted) but these are really minor annoyances compared to what this script achieves.

Usage is pretty simple:

booklet.sh <input_file_name.pdf>

and the output file will be input_file_name-booklet.pdf

 

Btw, this will be the first post under the new category Tools.

The Ghost in the Machine

May 16, 2008 Posted by Emre S. Tasci

Yesterday, during Alain P.’s “From Quantum to Atomistic Simulations” titled presentation, there was an argument over one of his comments on the generation of potentials, particularly the EAM potentials. He listed “No physical meaning (‘Any function is good as long as it works’)” as one of the specifications (or, more accurately as a freedom) which drew objections from the audience since they -naturally- insisted that when dealing with a physical problem at hand, we ought to keep in mind the physical aspects and reasoning – blindly determining some parameters in order to fit the data is surely not the correct procedure to follow.

This reminded me of some debate program we used to have back in Turkey called Siyaset Meydanı (can be translated as “Political Grounds”, maybe?). Most of the time, the participants would defend two opposing but in their own terms, surely agreeable opinions, for both of which you could easily find supporters. But on one of the sessions, they were discussing smoking. You can pretty guess that, nobody would defend smoking and that session was intended to be a “light” session and for that reason, the party in favor of smoking was made of artists and showmen. Now, it is like a sham fight at first because everybody already knows who the winner side is but then, one of the anti-smokers who happens to be a chairman of some anti-x thing, stands up and in minutes she’s like defending the hardest cause of all: smoking kills you, kills your loved ones, kills the children, kills the world… Her attitude triggered those who were initially “in favor of smoking” and they also began actually defending their cause. Since they were composed of popular artists and showmen, meaning they knew how to effectively present the ideas, at the end of the session, a strange triumph was achieved.

Actually, the related part of the anecdote I gave above ends with the debate topic being an already won cause. Definitely, when doing physics, we must keep in mind that we’re doing physics. But..

Before going into that “But…” and all the advances in computational tools plus the information theory, I’d like to include two contradicting(?) quotes on the subject. One of them comes from a physicist while the other belongs to a biologist:

“A theory with mathematical beauty is more likely to be correct than an ugly one that fits some experimental data.”
Paul Dirac

“The great tragedy of science is the slaying of a beautiful hypothesis by an ugly fact.”
Thomas Huxley

Maybe -but maybe- you can pave a way between this two by introducing another physicist:
“Make everything as simple as possible, but not simpler.”
Albert Einstein

So, back to the subject: It is natural and obvious to think in terms of the discipline’s domain that you are actually researching in but it’s getting more and more tempting to pull yourself back one step, look at the problem from above and treat it as mere data. I do not mean this in a context like “Digitalize everything! NOW!”, but more like a transformation where you carry the problem unto another realm where you can exploit that realm’s properties and tools, tackle the problem from a different angle, and after solving it, transform it back to where it belongs. In the “digital realm” it’s possible to violate, for example, causality principle since -from a mathematical point of view- you have two equally valid solutions but just go on, fork the results, solve for both of them and when you return from your trip to the 101010 (42, btw) make sure you check the implications, now.

Can I offer a quasi-solution to this conundrum? Yes, “any function is good as long as it works” but “as long as there is no objection from the physical side of view and also as long as there is no violation – now or ever, however unlikely it may be.” Black holes and dislocations: if you happen to step on them in silico and you didn’t know about them until then, and the physicist in you tells you that there is no such thing, what to do? The medical point of view clearly states that, in order to be successfull most of the time, you should stick to the regularizations in treatment. Yes, maybe a 5% that are exceptions will die because of the justifications but you’ll save the 95%. And I second that. You can’t spend your (calculation) time trying to comprehend every anamoly. If it was theory, again, don’t worry to much… Remember Einstein, Bose? Guess who got the Nobel for that, yes, things happen (Cornell, Wieman, Ketterle)…

An overheated argument which I should have included in the beginning of this entry:
“Woe unto those who hath sold their souls to silica! For they gave up their intution in return for some toys.”

Wishing that you always manage to return to your green land after your trip to 42 (6x9_13) land,
Yours truly.

“[…]Note that these concerns have nothing to do with the importance of messages. For example, a platitude such as “Thank you; come again” takes about as long to say or write as the urgent plea, “Call an ambulance!” while clearly the latter is more important and more meaningful. Information theory, however, does not involve message importance or meaning, as these are matters of the quality of data rather than the quantity of data, the latter of which is determined solely by probabilities.” (I was trying to find some already formulated essay about the “stripping of the knowledge from its content and thus treated solely as a data” but took the easiest way and copied from the wikipedia – sorry.

One more thing I wanted to write about but maybe some time later: David J.C. MacKay proposes a very insightful application of the Bayesian Interpretation on how it naturally follows after Occam’s razor principle [*]. The simple solution is indeed preferable to the complex one and having this principle formulated, one can now use this efficient fact in model comparison. So, if you accept that “nature will follow the simplest path among possibilities” as a priori, the algorithm can also make this distinction for you. Whether assume that there is “no physical meaning – any function is good as long as it works” or don’t assume it but eventually, and with the help of the Bayesian Interpretation, you’ll arrive at the same point, regardless of your assumptions (and if nature indeed favors the simpler form 8).

[*] – MacKay D.J.C. “Probable Networks and Plausible Predictions – A Review of Practical Bayesian Methods for Supervised Neural Networks” – http://www.inference.phy.cam.ac.uk/mackay/network.ps.gz | http://www.inference.phy.cam.ac.uk/mackay/BayesNets.html

"The two ideas of neural network modelling and Bayesian statistics might seem uneasy bed-fellows. Neural networks are non-linear parallel computational devices inspired by the structure of the brain. ‘Backpropagation networks’ are able to learn, by example, to solve prediction and classification problems. Such a neural network is typically viewed as a black box which finds by hook or by crook an incomprehensible solution to a poorly understood problem. In contrast, Bayesian methods are characterized by an insistence on coherent inference based on clearly defined axioms; in Bayesian circles, and ‘ad hockery’ is a capital offense. Thus Bayesian statistics and neural networks might seem to occupy opposite extremes of the data modelling spectrum."

Probable Networks and Plausible Predictions – A Review of Practical Bayesian Methods for Supervised Neural Networks *

David J.C. MacKay *

Specification of an extensible and portable file format for electronic structure and crystallographic data

X. Gonzea, C.-O. Almbladh, A. Cucca, D. Caliste, C. Freysoldt, M.A.L. Marques, V. Olevano, Y. Pouillon and M.J. Verstraet

Comput. Mater. Sci. (2008) doi:10.1016/j.commatsci.2008.02.023

Abstract

In order to allow different software applications, in constant evolution, to interact and exchange data, flexible file formats are needed. A file format specification for different types of content has been elaborated to allow communication of data for the software developed within the European Network of Excellence “NANOQUANTA”, focusing on first-principles calculations of materials and nanosystems. It might be used by other software as well, and is described here in detail. The format relies on the NetCDF binary input/output library, already used in many different scientific communities, that provides flexibility as well as portability across languages and platforms. Thanks to NetCDF, the content can be accessed by keywords, ensuring the file format is extensible and backward compatible.

Keywords

Electronic structure; File format standardization; Crystallographic datafiles; Density datafiles; Wavefunctions datafiles; NetCDF

PACS classification codes

61.68.+n; 63.20.dk; 71.15.Ap

Problems with the binary representations:

  1. lack of portability between big-endian and little-endian platforms (and vice-versa), or between 32-bit and 64-bit platforms;
  2. difficulties to read the files written by F77/90 codes from C/C++ software (and vice-versa)
  3. lack of extensibility, as one file produced for one version of the software might not be readable by a past/forthcoming version

(Network Common Data Form) library solves these issues. (Alas, it is binary and in the examples presented in  http://www.unidata.ucar.edu/software/netcdf/examples/files.html, it seems like the CDF representations are larger in size than those of the CDL metadata representations)

The idea of standardization of file formats is not new in the electronic structure community [X. Gonze, G. Zerah, K.W. Jakobsen, and K. Hinsen. Psi-k Newsletter 55, February 2003, pp. 129-134. URL: <http://psi-k.dl.ac.uk>] (This article is a must read article, by the way. It’s a very insightful and concerning overview of the good natured developments in the atomistic/molecular software domain. It is purifying in the attempt to produce something good..). However, it proved difficult to achieve without a formal organization, gathering code developers involved in different software project, with a sufficient incentive to realize effective file exchange between such software.

HDF (Hierarchical Data Format) is also an alternative. NetCDF is simpler to use if the data formats are flat, while HDF has definite advantages if one is dealing with hierarchical formats. Typically, we will need to describe many different multi-dimensional arrays of real or complex numbers, for which NetCDF is an adequate tool.

Although a data specification might be presented irrespective of the library actually used for the implementation, such a freedom might lead to implementations using incompatible file formats (like NetCDF and XML for instance). This possibility would in effect block the expected standardization gain. Thus, as a part of the standardization, we require the future implementations of our specification to rely on the NetCDF library only. (Alternative to independent schemas is presented as a mandatory format which is against the whole idea of development. The reason for NetCDF being incompatible with XML lies solely on the inflexibility/inextensibility of the prior format. Although such a readily defined and built format is adventageous considering the huge numerical data and parameters the ab inito software uses, it is very immobile and unnecessary for compound and element data.)

The compact representation brought by NetCDF can be by-passed in favour of text encoding (very agreable given the usage purposes: XML type extensible schema usage is much more adequate for structure/composition properties). We are aware of several standardization efforts [J. Junquera, M. Verstraete, X. Gonze, unpublished] and [J. Mortensen, F. Jollet, unpublished. Also check Minutes of the discussion on XML format for PAW setups], relying on XML, which emphasize addressing by content to represent such atomic data.

4 types of data types are distinguished:

  • (A) The actual numerical data (which defines whether a file contains wavefunctions, a density, etc), for which a name must have been agreed in the specification.
  • (B) The auxiliary data that is mandatory to make proper usage of the actual numerical data of A-type. The name and description of this auxiliary information is also agreed.
  • (C) The auxiliary data that is not mandatory to make proper usage of the A-type numerical data, but for which a name and description has been agreed in the specification.
  • (D) Other data, typically code-dependent, whose availability might help the use of the file for a specific code. The name of these variables should be different from the names chosen for agreed variables of A–C types. Such type D data might even be redundant with type A–C data.

The NetCDF interface adapts the dimension ordering to the programming language used. The notation here is C-like, i.e. row-major storage, the last index varying the fastest. In FORTRAN, a similar memory mapping is obtained by reversing the order of the indices. (So, the ordering/reverse-ordering is handled by the interface/library)

Concluding Remarks

We presented the specifications for a file format relying on the NetCDF I/O library with a content related to electronic structure and crystallographic data. This specification takes advantage of all the interesting properties of NetCDF-based files, in particular portability and extensibility. It is designed for both serial and distributed usage, although the latter characteristics was not presented here.

Several software in the Nanoquanta and ETSF [15] context can produce or read this file format: ABINIT [16] and [17], DP [18], GWST, SELF [19], V_Sim [20]. In order to further encourage its use, a library of Fortran routines [5] has been set up, and is available under the GNU LGPL licence.

Additional Information:

Announcement for a past event
( http://www.tddft.org/pipermail/fsatom/2003-February/000004.html )

CECAM – psi-k – SIMU joint Tutorial

1) Title : Software solutions for data exchange and code gluing.

Location : Lyon Dates : 8-10 october, 2003

Purpose : In this tutorial, we will teach software tools and standards that have recently emerged in view of the exchange of data (text and binary) and gluing of codes : (1) Python, as scripting language, its interfaces with C and FORTRAN ; (2) XML, a standard for representing structured data in text files ; (3) netCDF, a library and file format for the exchange and storage of binary data, and its interfaces with C, Fortran, and Python

Organizers : X. Gonze gonze@pcpm.ucl.ac.be

K. Hinsen hinsen@cnrs-orleans.fr

2) Scientific content

Recent discussions, related to the CECAM workshop on "Open Source Software for Microscopic Simulations", June 19-21, 2002, to the GRID concept (http://www.gridcomputing.com), as well as to the future Integrated Infrastructure Initiative proposal linked to the European psi-k network (http://psi-k.dl.ac.uk), have made clear that one challenge for the coming years is the ability to establish standards for accessing codes, transferring data between codes, testing codes against each other, and become able to "glue" them (this being facilitated by the Free Software concept).

In the present tutorial, we would like to teach three "software solutions" to face this challenge : Python, XML and netCDF.

Python is now the de facto "scripting langage" standard in the computational physics and chemistry community. XML (eXtended Markup Language) is a framework for building mark-up languages, allowing to set-up self-describing documents, readable by humans and machines. netCDF allows binary files to be portable accross platforms. It is not our aim to cover all possible solutions to the above-mentioned challenges (e.g. PERL, Tcl, or HDF), but these three have proven suitable for atomic-scale simulations, in the framework of leading projects like CAMPOS (http://www.fysik.dtu.dk/campos), MMTK (http://dirac.cnrs-orleans.fr/MMTK), and GROMACS (http://www.gromacs.org). Other software projects like ABINIT (http://www.abinit.org) and PWSCF (http://www.pwscf.org – in the DEMOCRITOS context), among others, have made clear their interest for these. All of these software solutions can be used without having to buy a licence.

Tentative program of the tutorial. Lectures in the morning, hands-on training in the afternoon.

1st day ——- 2h Python basics 1h Interface : Python/C or FORTRAN 1h XML basics Afternoon Training with Python, and interfaces with C and FORTRAN

2nd day ——- 2h Python : object oriented (+ an application to GUI and Tk) 1h Interface : Python/XML 1h Interface : XML + C or FORTRAN Afternoon Training with XML + interfaces

3rd day ——- 1h Python : numerical 1h netCDF basics 1h Interface : netCDF/Python 1h Interface : netCDF/C or FORTRAN Afternoon Training with netCDF + interfaces

3) List of lecturers

K. Hinsen (Orleans, France), organizer X. Gonze (Louvain-la-Neuve, Belgium), organizer K. Jakobsen (Lyngby, Denmark), instructor J. Schiotz (Lyngby, Denmark), instructor J. Van Der Spoel (Groningen, The Netherlands), instructor M. van Loewis (Berlin, Germany), instructor

4) Number of participants : around 20, Most of the participants should be PhD students, postdoc or young permanent scientists, involved in code development. It is assumed that the attendants have a good knowledge of UNIX, and C or FORTRAN.

Our budget will allow contributing to travel and local expenses of up to 20 participants.

XML and NetCDF:

from Specification of file formats for NANOQUANTA/ETSF Specification – www.etsf.eu/research/software/nq_specff_v2.1_final.pdf

Section 1. General considerations concerning the present file format specifications.

One has to consider separately the set of data to be included in each of different types of files, from their representation. Concerning the latter, one encounters simple text files, binary files, XML-structured files, NetCDF files, etc … It was already decided previously (Nanoquanta meeting Maratea Sept. 2004) to evolve towards formats that deal appropriately with the self- description issue, i.e. XML and NetCDF. The inherent flexibility of these representations will also allow to evolve specific versions of each type of files progressively, and refine earlier working proposals. The same direction has been adopted by several groups of code developers that we know of.

Information on NetCDF and XML can be obtained from the official Web sites,

http://www.unidata.ucar.edu/software/netcdf/ and

http://www.w3.org/XML/

There are numerous other presentations of these formats on the Web, or in books.

The elaboration of file formats based on NetCDF has advanced a lot during the Louvain-la- Neuve mini-workshop. There has been also some remarks about XML.

Concerning XML :

(A) The XML format is most adapted for the structured representation of relatively small quantity of data, as it is not compressed.

(B) It is a very flexible format, but hard to read in Fortran (no problem in C, C++ or Python). Recently, Alberto Garcia has set up a XMLF90 library of routines to read XML from Fortran. http://lcdx00.wm.lc.ehu.es/~wdpgaara/xml/index.html Other efforts exists in this direction http://nn-online.org/code/xml/

Concerning NetCDF

  • (A) Several groups of developers inside NQ have already a good experience of using it, for the representation of binary data (large files).
  • (B) Although there is no clear advantage of NetCDF compared to HDF (another possibility for large binary files), this experience inside the NQ network is the main reason for preferring it. By the way, NetCDF and HDF are willing to merge (this might take a few years, though).
  • (C) File size limitations of NetCDF exist, see appendix D, but should be overcome in the future.

Thanks to the flexibility of NetCDF, the content of a NetCDF file format suitable for use for NQ softwares might be of four different types :

(1) The actual numerical data (that defines a file for wavefunctions, or a density file, etc …), whose NetCDF description would have been agreed.

(2) The auxiliary data that are mandatory to make proper usage of the actual numerical data. The NetCDF description of these auxiliary data should also be agreed.

(3) The auxiliary data that are not mandatory, but whose NetCDF description has been agreed, in a larger context.

(4) Other data, typically code-dependent, whose existence might help the use of the file for a specific code.

References:
[5] URL: <http://www.etsf.eu/index.php?page=tools>.

[15] URL: <http://www.etsf.eu/>.

[16] X. Gonze, J.-M. Beuken, R. Caracas, F. Detraux, M. Fuchs, G.-M. Rignanese, L. Sindic, M. Verstraete, G. Zerah, F. Jollet, M. Torrent, A. Roy, M. Mikami, Ph. Ghosez, J.-Y. Raty and D.C. Allan, Comput. Mater. Sci. 25 (2002), pp. 478–492.

[17] X. Gonze, G.-M. Rignanese, M. Verstraete, J.-M. Beuken, Y. Pouillon, R. Caracas, F. Jollet, M. Torrent, G. Zérah, M. Mikami, Ph. Ghosez, M. Veithen, J.-Y. Raty, V. Olevano, F. Bruneval, L. Reining, R. Godby, G. Onida, D.R. Hamann and D.C. Allan, Zeit. Kristall. 220 (2005), pp. 558–562.

[18] URL: <http://dp-code.org>.

[19] URL: <http://www.bethe-salpeter.org>.

[20] URL: http://www-drfmc.cea.fr/sp2m/L_Sim/V_Sim/index.en.html.